* sleeps and waits during io_submit @ 2015-11-28 2:43 Glauber Costa 2015-11-30 14:10 ` Brian Foster 2015-11-30 23:10 ` Dave Chinner 0 siblings, 2 replies; 58+ messages in thread From: Glauber Costa @ 2015-11-28 2:43 UTC (permalink / raw) To: xfs, Avi Kivity, david [-- Attachment #1: Type: text/plain, Size: 3130 bytes --] Hello my dear XFSers, For those of you who don't know, we at ScyllaDB produce a modern NoSQL data store that, at the moment, runs on top of XFS only. We deal exclusively with asynchronous and direct IO, due to our thread-per-core architecture. Due to that, we avoid issuing any operation that will sleep. While debugging an extreme case of bad performance (most likely related to a not-so-great disk), I have found a variety of cases in which XFS blocks. To find those, I have used perf record -e sched:sched_switch -p <pid_of_db>, and I am attaching the perf report as xfs-sched_switch.log. Please note that this doesn't tell me for how long we block, but as mentioned before, blocking operations outside our control are detrimental to us regardless of the elapsed time. For those who are not acquainted to our internals, please ignore everything in that file but the xfs functions. For the xfs symbols, there are two kinds of events: the ones that are a children of io_submit, where we don't tolerate blocking, and the ones that are children of our helper IO thread, to where we push big operations that we know will block until we can get rid of them all. We care about the former and ignore the latter. Please allow me to ask you a couple of questions about those findings. If we are doing anything wrong, advise on best practices is truly welcome. 1) xfs_buf_lock -> xfs_log_force. I've started wondering what would make xfs_log_force sleep. But then I have noticed that xfs_log_force will only be called when a buffer is marked stale. Most of the times a buffer is marked stale seems to be due to errors. Although that is not my case (more on that), it got me thinking that maybe the right thing to do would be to avoid hitting this case altogether? The file example-stale.txt contains a backtrace of the case where we are being marked as stale. It seems to be happening when we convert the the inode's extents from unwritten to real. Can this case be avoided? I won't pretend I know the intricacies of this, but couldn't we be keeping extents from the very beginning to avoid creating stale buffers? 2) xfs_buf_lock -> down This is one I truly don't understand. What can be causing contention in this lock? We never have two different cores writing to the same buffer, nor should we have the same core doingCAP_FOWNER so. 3) xfs_file_aio_write_checks -> file_update_time -> xfs_vn_update_time You guys seem to have an interface to avoid that, by setting the FMODE_NOCMTIME flag. This is done by issuing the open by handle ioctl, which will set this flag for all regular files. That's great, but that ioctl required CAP_SYS_ADMIN, which is a big no for us, since we run our server as an unprivileged user. I don't understand, however, why such an strict check is needed. If we have full rights on the filesystem, why can't we issue this operation? In my view, CAP_FOWNER should already be enough.I do understand the handles have to be stable and a file can have its ownership changed, in which case the previous owner would keep the handle valid. Is that the reason you went with the most restrictive capability ? [-- Attachment #2: xfs-sched_switch.log --] [-- Type: text/plain, Size: 171763 bytes --] # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 2K of event 'sched:sched_switch' # Event count (approx.): 2669 # # Overhead Command Shared Object Symbol # ........ ....... ................. .............. # 100.00% scylla [kernel.kallsyms] [k] __schedule | ---__schedule | |--96.18%-- schedule | | | |--56.14%-- schedule_user | | | | | |--53.30%-- int_careful | | | | | | | |--45.05%-- 0x7f4ade6f74ed | | | | reactor_backend_epoll::make_reactor_notifier | | | | | | | | | |--67.63%-- syscall_work_queue::submit_item | | | | | | | | | | | |--32.05%-- posix_file_impl::truncate | | | | | | | | | | | | | |--65.33%-- _ZN12continuationIZN6futureIJEE4thenIZN19file_data_sink_impl5flushEvEUlvE_S1_EET0_OT_EUlS7_E_JEE3runEv | | | | | | | reactor::del_timer | | | | | | | 0x60b0000e2040 | | | | | | | | | | | | | |--20.00%-- db::commitlog::segment::flush(unsigned long)::{lambda()#1}::operator() | | | | | | | | | | | | | | | |--73.33%-- future<>::then<db::commitlog::segment::flush(unsigned long)::{lambda()#1}, future<lw_shared_ptr<db::commitlog::segment> > > | | | | | | | | _ZN12continuationIZN6futureIJ13lw_shared_ptrIN2db9commitlog7segmentEEEE4thenIZNS4_4syncEvEUlT_E_S6_EET0_OS8_EUlSB_E_JS5_EE3runEv | | | | | | | | reactor::del_timer | | | | | | | | 0x60e0000e2040 | | | | | | | | | | | | | | | --26.67%-- _ZN12continuationIZN6futureIJEE4thenIZN2db9commitlog7segment5flushEmEUlvE_S0_IJ13lw_shared_ptrIS5_EEEEET0_OT_EUlSC_E_JEE3runEv | | | | | | | reactor::del_timer | | | | | | | 0x6090000e2040 | | | | | | | | | | | | | |--10.67%-- sstables::sstable::seal_sstable | | | | | | | std::_Function_handler<void (), futurize<std::result_of<std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type ()>::type>::type seastar::async<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>(std::decay&&, (std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type&&)...)::{lambda(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work&)#1}::operator()(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work)::{lambda()#1}>::_M_invoke | | | | | | | _GLOBAL__sub_I__ZN12app_templateC2Ev | | | | | | | | | | | | | --4.00%-- sstables::sstable::write_toc | | | | | | sstables::sstable::prepare_write_components | | | | | | | | | | | | | |--50.00%-- 0x4d3a4f6ec4e8cd75 | | | | | | | | | | | | | --50.00%-- 0x3ebf3dd80e3b174d | | | | | | | | | | | |--23.93%-- posix_file_impl::discard | | | | | | | | | | | | | |--82.14%-- _ZN12continuationIZN6futureIImEE4thenIZN19file_data_sink_impl6do_putEm16temporary_bufferIcEEUlmE_S0_IIEEEET0_OT_EUlSA_E_ImEE3runEv | | | | | | | reactor::del_timer | | | | | | | 0x6080000e2040 | | | | | | | | | | | | | --17.86%-- futurize<future<lw_shared_ptr<db::commitlog::segment> > >::apply<db::commitlog::segment_manager::allocate_segment(bool)::{lambda(file)#1}, file> | | | | | | _ZN12continuationIZN6futureIJ4fileEE4thenIZN2db9commitlog15segment_manager16allocate_segmentEbEUlS1_E_S0_IJ13lw_shared_ptrINS5_7segmentEEEEEET0_OT_EUlSE_E_JS1_EE3runEv | | | | | | | | | | | |--20.94%-- reactor::open_file_dma | | | | | | | | | | | | | |--20.41%-- db::commitlog::segment_manager::allocate_segment | | | | | | | db::commitlog::segment_manager::on_timer()::{lambda()#1}::operator() | | | | | | | 0xb8c264 | | | | | | | | | | | | | |--14.29%-- sstables::sstable::write_simple<(sstables::sstable::component_type)8, sstables::statistics> | | | | | | | std::_Function_handler<void (), futurize<std::result_of<std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type ()>::type>::type seastar::async<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>(std::decay&&, (std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type&&)...)::{lambda(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work&)#1}::operator()(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work)::{lambda()#1}>::_M_invoke | | | | | | | _GLOBAL__sub_I__ZN12app_templateC2Ev | | | | | | | | | | | | | |--12.24%-- sstables::write_crc | | | | | | | | | | | | | | | |--16.67%-- 0x313532343536002f | | | | | | | | | | | | | | | |--16.67%-- 0x373633323533002f | | | | | | | | | | | | | | | |--16.67%-- 0x363139333232002f | | | | | | | | | | | | | | | |--16.67%-- 0x353933303330002f | | | | | | | | | | | | | | | |--16.67%-- 0x383930383133002f | | | | | | | | | | | | | | | --16.67%-- 0x323338303037002f | | | | | | | | | | | | | |--12.24%-- sstables::write_digest | | | | | | | | | | | | | |--10.20%-- sstables::sstable::write_simple<(sstables::sstable::component_type)7, sstables::filter> | | | | | | | sstables::sstable::write_filter | | | | | | | std::_Function_handler<void (), futurize<std::result_of<std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type ()>::type>::type seastar::async<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>(std::decay&&, (std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type&&)...)::{lambda(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work&)#1}::operator()(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work)::{lambda()#1}>::_M_invoke | | | | | | | _GLOBAL__sub_I__ZN12app_templateC2Ev | | | | | | | | | | | | | |--10.20%-- sstables::sstable::write_simple<(sstables::sstable::component_type)4, sstables::summary_ka> | | | | | | | std::_Function_handler<void (), futurize<std::result_of<std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type ()>::type>::type seastar::async<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>(std::decay&&, (std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type&&)...)::{lambda(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work&)#1}::operator()(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work)::{lambda()#1}>::_M_invoke | | | | | | | _GLOBAL__sub_I__ZN12app_templateC2Ev | | | | | | | | | | | | | |--10.20%-- 0x78d93b | | | | | | | | | | | | | |--6.12%-- sstables::sstable::open_data | | | | | | | | | | | | | | | --100.00%-- 0x8000000004000000 | | | | | | | | | | | | | --4.08%-- sstables::sstable::write_toc | | | | | | sstables::sstable::prepare_write_components | | | | | | | | | | | | | --100.00%-- 0x6100206690ef | | | | | | | | | | | |--18.38%-- syscall_work_queue::submit_item | | | | | | | | | | | | | |--10.00%-- 0x7f4ad89f8fe0 | | | | | | | | | | | | | |--7.50%-- 0x7f4ad83f8fe0 | | | | | | | | | | | | | |--7.50%-- 0x7f4ad6bf8fe0 | | | | | | | | | | | | | |--7.50%-- 0x7f4ad65f8fe0 | | | | | | | | | | | | | |--5.00%-- 0x60b015e8cd90 | | | | | | | | | | | | | |--5.00%-- 0x60100acaed90 | | | | | | | | | | | | | |--5.00%-- 0x607006f04d90 | | | | | | | | | | | | | |--5.00%-- 0xffffffffffffa5d0 | | | | | | | | | | | | | |--2.50%-- 0x60e01acbed90 | | | | | | | | | | | | | |--2.50%-- 0x60e01acbec60 | | | | | | | | | | | | | |--2.50%-- 0x60a018d7ad90 | | | | | | | | | | | | | |--2.50%-- 0x60a018d7ac60 | | | | | | | | | | | | | |--2.50%-- 0x60b015e8cc60 | | | | | | | | | | | | | |--2.50%-- 0x60900bb8ad60 | | | | | | | | | | | | | |--2.50%-- 0x60100acaec60 | | | | | | | | | | | | | |--2.50%-- 0x60800951dd90 | | | | | | | | | | | | | |--2.50%-- 0x60800951dc60 | | | | | | | | | | | | | |--2.50%-- 0x60d009089d90 | | | | | | | | | | | | | |--2.50%-- 0x60d009089c60 | | | | | | | | | | | | | |--2.50%-- 0x607006f04c60 | | | | | | | | | | | | | |--2.50%-- 0x60f005984d60 | | | | | | | | | | | | | |--2.50%-- 0x7f4ad77f8fe0 | | | | | | | | | | | | | |--2.50%-- 0x7f4adb9f8fe0 | | | | | | | | | | | | | |--2.50%-- 0x7f4ad9bf8fe0 | | | | | | | | | | | | | |--2.50%-- 0x7f4ad7df8fe0 | | | | | | | | | | | | | |--2.50%-- 0x7f4ad77f8fe0 | | | | | | | | | | | | | --2.50%-- 0x7f4ad5ff8fe0 | | | | | | | | | | | |--2.99%-- reactor::open_directory | | | | | | | | | | | | | |--57.14%-- sstables::sstable::filename | | | | | | | | | | | | | --42.86%-- sstables::sstable::write_toc | | | | | | sstables::sstable::prepare_write_components | | | | | | | | | | | | | |--50.00%-- 0x4d3a4f6ec4e8cd75 | | | | | | | | | | | | | --50.00%-- 0x3ebf3dd80e3b174d | | | | | | | | | | | --1.71%-- reactor::rename_file | | | | | sstables::sstable::seal_sstable | | | | | std::_Function_handler<void (), futurize<std::result_of<std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type ()>::type>::type seastar::async<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>(std::decay&&, (std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type&&)...)::{lambda(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work&)#1}::operator()(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work)::{lambda()#1}>::_M_invoke | | | | | _GLOBAL__sub_I__ZN12app_templateC2Ev | | | | | | | | | --32.37%-- _ZN12continuationIZN6futureIJEE4thenIZN18syscall_work_queue11submit_itemEPNS3_9work_itemEEUlvE_S1_EET0_OT_EUlS9_E_JEE3runEv | | | | reactor::del_timer | | | | 0x60d0000e2040 | | | | | | | |--29.04%-- __vdso_clock_gettime | | | | | | | |--19.66%-- 0x7f4ade42b193 | | | | reactor_backend_epoll::complete_epoll_event | | | | | | | | | |--41.61%-- smp_message_queue::async_work_item<future<> seastar::sharded<database>::invoke_on<service::storage_proxy::mutate_locally(frozen_mutation const&)::{lambda(database&)#1}, future<> >(unsigned int, service::storage_proxy::mutate_locally(frozen_mutation const&)::{lambda(database&)#1}&&)::{lambda()#1}>::process | | | | | | | | | | | |--79.03%-- smp_message_queue::process_queue<2ul, smp_message_queue::process_incoming()::{lambda(smp_message_queue::work_item*)#1}> | | | | | | | | | | | | | |--95.92%-- 0x6070000c3000 | | | | | | | | | | | | | |--2.04%-- 0x61d0000c1000 | | | | | | | | | | | | | --2.04%-- 0x61d0000c1000 | | | | | | | | | | | |--3.23%-- 0x14dd51 | | | | | | | | | | | |--1.61%-- 0x162a54 | | | | | | | | | | | |--1.61%-- 0x161dca | | | | | | | | | | | |--1.61%-- 0x159c8b | | | | | | | | | | | |--1.61%-- 0x1598b5 | | | | | | | | | | | |--1.61%-- 0x14dd3e | | | | | | | | | | | |--1.61%-- 0x14bad8 | | | | | | | | | | | |--1.61%-- 0x14a880 | | | | | | | | | | | |--1.61%-- 0x127105 | | | | | | | | | | | |--1.61%-- 0x6070000e2040 | | | | | | | | | | | |--1.61%-- smp_message_queue::process_queue<2ul, smp_message_queue::process_incoming()::{lambda(smp_message_queue::work_item*)#1}> | | | | | | 0x60d0000c3000 | | | | | | | | | | | --1.61%-- __vdso_clock_gettime | | | | | 0x7f4ad77f9160 | | | | | | | | | |--30.20%-- __restore_rt | | | | | | | | | | | |--57.14%-- __vdso_clock_gettime | | | | | | 0x1d | | | | | | | | | | | |--9.52%-- smp_message_queue::smp_message_queue | | | | | | 0x6070000c3000 | | | | | | | | | | | |--4.76%-- 0x600000357240 | | | | | | | | | | | |--4.76%-- 0x60000031a640 | | | | | | | | | | | |--2.38%-- posix_file_impl::list_directory | | | | | | 0x609000044730 | | | | | | | | | | | |--2.38%-- 0x46efbf | | | | | | | | | | | |--2.38%-- 0x600000442e40 | | | | | | | | | | | |--2.38%-- 0x600000376440 | | | | | | | | | | | |--2.38%-- 0x6000002bac40 | | | | | | | | | | | |--2.38%-- 0x600000295640 | | | | | | | | | | | |--2.38%-- 0x600000289e40 | | | | | | | | | | | |--2.38%-- 0x60000031a640 | | | | | | | | | | | |--2.38%-- 0x7f4ade6f74ed | | | | | | __libc_siglongjmp | | | | | | 0x60000047be40 | | | | | | | | | | | --2.38%-- 0x7f4adb3f7fd0 | | | | | | | | | |--14.09%-- 0x33 | | | | | | | | | |--12.08%-- promise<temporary_buffer<char> >::promise | | | | | _ZN6futureIJ16temporary_bufferIcEEE4thenIZN12input_streamIcE12read_exactlyEmEUlT_E_S2_EET0_OS6_ | | | | | | | | | | | |--44.44%-- input_stream<char>::read_exactly | | | | | | 0x8 | | | | | | | | | | | |--11.11%-- 0x7f4adb3f8ea0 | | | | | | | | | | | |--11.11%-- 0x7f4ad9bf8ea0 | | | | | | | | | | | |--11.11%-- 0x7f4ad89f8ea0 | | | | | | | | | | | |--11.11%-- 0x7f4ad83f8ea0 | | | | | | | | | | | |--5.56%-- 0x7f4ad77f8ea0 | | | | | | | | | | | --5.56%-- 0x7f4ad7df8ea0 | | | | | | | | | |--1.34%-- 0x7f4ad6bf8d80 | | | | | | | | | --0.67%-- 0x7f4adadf8d80 | | | | | | | |--4.43%-- __libc_send | | | | _ZN12continuationIZN6futureIJmEE4thenIZN7reactor14write_all_partER17pollable_fd_statePKvmmEUlmE_S0_IJEEEET0_OT_EUlSC_E_JmEE3runEv | | | | | | | | | |--14.71%-- 0x4 | | | | | | | | | |--11.76%-- 0x7f4ad89f8de0 | | | | | | | | | |--8.82%-- 0x7f4adb3f8de0 | | | | | | | | | |--8.82%-- 0x7f4ad9bf8de0 | | | | | | | | | |--8.82%-- 0x7f4ad77f8de0 | | | | | | | | | |--8.82%-- 0x7f4ad6bf8de0 | | | | | | | | | |--5.88%-- 0x7f4ad83f8de0 | | | | | | | | | |--5.88%-- 0x7f4ad7df8de0 | | | | | | | | | |--5.88%-- 0x7f4ad53f8de0 | | | | | | | | | |--2.94%-- 0x7f4acc9f8de0 | | | | | | | | | |--2.94%-- continuation<future<file>::wait()::{lambda(future_state<file>&&)#1}, file>::~continuation | | | | | 0x611003c8e9b8 | | | | | | | | | |--2.94%-- 0x7f4adb9f8de0 | | | | | | | | | |--2.94%-- 0x7f4ad71f8de0 | | | | | | | | | |--2.94%-- 0x7f4ad65f8de0 | | | | | | | | | |--2.94%-- 0x7f4ad59f8de0 | | | | | | | | | --2.94%-- 0x7f4ad35f8de0 | | | | | | | |--1.56%-- 0x7f4ade6f754d | | | | reactor::read_some | | | | | | | | | |--66.67%-- _ZN12continuationIZN6futureIJEE4thenIZZN7service13storage_proxy22send_to_live_endpointsEmENKUlRSt4pairIK13basic_sstringIcjLj15EESt6vectorIN3gms12inet_addressESaISB_EEEE_clESF_EUlvE_S1_EET0_OT_EUlSK_E_JEE3runEv | | | | | reactor::del_timer | | | | | 0x6070000e2040 | | | | | | | | | |--8.33%-- _ZN12continuationIZN6futureIIEE4thenIZ5sleepINSt6chrono3_V212system_clockEmSt5ratioILl1ELl1000000EEES1_NS4_8durationIT0_T1_EEEUlvE_S1_EESA_OT_EUlSF_E_IEE3runEv | | | | | reactor::del_timer | | | | | 0x6080000e2040 | | | | | | | | | |--8.33%-- 0x600000483640 | | | | | | | | | |--8.33%-- 0x600000480440 | | | | | | | | | --8.33%-- 0x36 | | | --0.26%-- [...] | | | | | --46.70%-- retint_careful | | | | | |--6.24%-- posix_file_impl::list_directory | | | | | | | |--80.00%-- 0x60f0000e2020 | | | | | | | |--5.00%-- 0x601000044730 | | | | | | | |--5.00%-- 0x60e000044720 | | | | | | | |--2.50%-- 0x60f000135500 | | | | | | | |--2.50%-- 0x6190000e2098 | | | | | | | |--2.50%-- 0x60d0000c3000 | | | | | | | --2.50%-- 0x1 | | | | | |--3.42%-- boost::lockfree::detail::ringbuffer_base<smp_message_queue::work_item*>::pop | | | | | | | |--95.65%-- boost::program_options::variables_map::get | | | | | | | --4.35%-- 0x618000044680 | | | | | |--3.12%-- memory::small_pool::add_more_objects | | | | | | | |--10.53%-- managed_vector<atomic_cell_or_collection, 5u, unsigned int>::clear_and_release | | | | mutation_partition::clustered_row | | | | mutation::set_clustered_cell | | | | cql3::constants::setter::execute | | | | cql3::statements::update_statement::add_update_for_key | | | | _ZN8futurizeI6futureIJSt6vectorI8mutationSaIS2_EEEEE5applyIZN4cql310statements22modification_statement13get_mutationsERN7seastar7shardedIN7service13storage_proxyEEERKNS8_13query_optionsEblEUlT_E_JSt10unique_ptrINS8_17update_parametersESt14default_deleteISN_EEEEES5_OSK_OSt5tupleIJDpT0_EE | | | | cql3::statements::modification_statement::get_mutations | | | | cql3::statements::modification_statement::execute_without_condition | | | | cql3::query_options::query_options | | | | | | | | | |--50.00%-- 0x7f4ad77f80e0 | | | | | | | | | --50.00%-- 0x7f4ad6bf80e0 | | | | | | | |--10.53%-- memory::small_pool::add_more_objects | | | | | | | | | |--50.00%-- 0x60e00015d000 | | | | | | | | | --50.00%-- 0x60b00af6c758 | | | | | | | |--5.26%-- 0x60a018ee3867 | | | | | | | |--5.26%-- 0x60d00d41f680 | | | | | | | |--5.26%-- 0x61400c6bb4d0 | | | | | | | |--5.26%-- 0x60e007c918d6 | | | | | | | |--5.26%-- 0x60e0078294ce | | | | | | | |--5.26%-- 0x607006ee4da0 | | | | | | | |--5.26%-- _ZN12continuationIZN6futureIJEE12then_wrappedIZNS1_16handle_exceptionIZN7service13storage_proxy22send_to_live_endpointsEmEUlNSt15__exception_ptr13exception_ptrEE0_EES1_OT_EUlSA_E_S1_EET0_SA_EUlSA_E_JEE3runEv | | | | reactor::del_timer | | | | 0x6030000e2040 | | | | | | | |--5.26%-- service::storage_proxy::mutate_locally | | | | service::storage_proxy::send_to_live_endpoints | | | | parallel_for_each<__gnu_cxx::__normal_iterator<unsigned long*, std::vector<unsigned long, std::allocator<unsigned long> > >, service::storage_proxy::mutate_begin(std::vector<unsigned long, std::allocator<unsigned long> >, db::consistency_level)::{lambda(unsigned long)#1}> | | | | 0x601000136d00 | | | | | | | |--5.26%-- 0x60a0001900e0 | | | | | | | |--5.26%-- 0x60e00015d040 | | | | | | | |--5.26%-- 0x61300015d000 | | | | | | | |--5.26%-- 0x60e00013bde0 | | | | | | | |--5.26%-- 0x60b00010f308 | | | | | | | |--5.26%-- 0x6010000e4808 | | | | | | | --5.26%-- 0x7f4ad65f7f50 | | | | | |--2.82%-- std::unique_ptr<reactor::pollfn, std::default_delete<std::unique_ptr> > reactor::make_pollfn<reactor::run()::{lambda()#3}>(reactor::run()::{lambda()#3}&&)::the_pollfn::poll_and_check_more_work | | | | | | | |--25.00%-- boost::lockfree::detail::ringbuffer_base<smp_message_queue::work_item*>::pop | | | | boost::program_options::variables_map::get | | | | | | | |--25.00%-- 0x1 | | | | | | | |--12.50%-- 0x53 | | | | | | | |--12.50%-- 0x3e | | | | | | | |--12.50%-- 0x24 | | | | | | | --12.50%-- 0xb958000000000000 | | | | | |--2.67%-- std::_Function_handler<partition_presence_checker_result (partition_key const&), column_family::make_partition_presence_checker(lw_shared_ptr<std::map<long, lw_shared_ptr<sstables::sstable>, std::less<long>, std::allocator<std::pair<long const, lw_shared_ptr<sstables::sstable> > > > >)::{lambda(partition_key const&)#1}>::_M_invoke | | | | | | | |--66.67%-- 0x1b5c280 | | | | | | | |--27.78%-- managed_vector<atomic_cell_or_collection, 5u, unsigned int>::resize | | | | row::apply | | | | mutation_partition_applier::accept_row_cell | | | | mutation_partition_view::accept | | | | | | | --5.56%-- 0x2a4399 | | | | | |--2.08%-- smp_message_queue::smp_message_queue | | | | | | | |--60.00%-- 0x60f0000c3000 | | | | | | | |--10.00%-- 0x6000002d7240 | | | | | | | |--10.00%-- 0x19 | | | | | | | |--10.00%-- 0xb | | | | | | | --10.00%-- 0x7 | | | | | |--1.93%-- smp_message_queue::process_queue<4ul, smp_message_queue::process_completions()::{lambda(smp_message_queue::work_item*)#1}> | | | | | |--1.63%-- __vdso_clock_gettime | | | | | | | --100.00%-- __clock_gettime | | | std::chrono::_V2::system_clock::now | | | 0xa63209 | | | | | |--1.49%-- memory::small_pool::deallocate | | | | | | | |--40.00%-- managed_vector<atomic_cell_or_collection, 5u, unsigned int>::emplace_back<atomic_cell_or_collection> | | | | | | | |--20.00%-- std::_Hashtable<unsigned long, std::pair<unsigned long const, service::storage_proxy::rh_entry>, std::allocator<std::pair<unsigned long const, service::storage_proxy::rh_entry> >, std::__detail::_Select1st, std::equal_to<unsigned long>, std::hash<unsigned long>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_erase | | | | service::storage_proxy::got_response | | | | _ZN12continuationIZN6futureIJEE4thenIZZN7service13storage_proxy22send_to_live_endpointsEmENKUlRSt4pairIK13basic_sstringIcjLj15EESt6vectorIN3gms12inet_addressESaISB_EEEE_clESF_EUlvE_S1_EET0_OT_EUlSK_E_JEE3runEv | | | | reactor::del_timer | | | | 0x6100000e2040 | | | | | | | |--10.00%-- cql3::statements::modification_statement::get_mutations | | | | | | | |--10.00%-- cql3::statements::modification_statement::build_partition_keys | | | | cql3::statements::modification_statement::create_exploded_clustering_prefix | | | | 0x60c014be0b00 | | | | | | | |--10.00%-- mutation_partition::~mutation_partition | | | | std::vector<mutation, std::allocator<mutation> >::~vector | | | | service::storage_proxy::mutate_with_triggers | | | | cql3::statements::modification_statement::execute_without_condition | | | | cql3::statements::modification_statement::execute | | | | cql3::query_processor::process_statement | | | | transport::cql_server::connection::process_execute | | | | transport::cql_server::connection::process_request_one | | | | futurize<future<std::pair<foreign_ptr<shared_ptr<transport::cql_server::response> >, service::client_state> > >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}::operator()(temporary_buffer) const::{lambda()#1}::operator()()::{lambda()#1}&> | | | | futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}, temporary_buffer> | | | | futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}, future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> > > | | | | 0x8961de | | | | | | | --10.00%-- object_deleter_impl<deleter>::~object_deleter_impl | | | _ZN12continuationIZN6futureIJEE12then_wrappedIZZNS1_7finallyIZ7do_withI11foreign_ptrI10shared_ptrIN9transport10cql_server8responseEEEZZNS8_10connection14write_responseEOSB_ENUlvE_clEvEUlRT_E_EDaOSF_OT0_EUlvE_EES1_SI_ENUlS1_E_clES1_EUlSF_E_S1_EESJ_SI_EUlSI_E_JEED0Ev | | | 0x61a0000c3db0 | | | | | |--1.34%-- dht::decorated_key::equal | | | | | | | |--83.33%-- 0x607000138f00 | | | | | | | --16.67%-- 0x60a0000e0f40 | | | | | |--1.34%-- service::storage_proxy::send_to_live_endpoints | | | | | |--1.19%-- transport::cql_server::connection::process_execute | | | transport::cql_server::connection::process_request_one | | | futurize<future<std::pair<foreign_ptr<shared_ptr<transport::cql_server::response> >, service::client_state> > >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}::operator()(temporary_buffer) const::{lambda()#1}::operator()()::{lambda()#1}&> | | | futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}, temporary_buffer> | | | futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}, future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> > > | | | | | | | |--87.50%-- transport::cql_server::connection::process_request | | | | do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&> | | | | do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> > | | | | do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> > | | | | 0x60e0000c3000 | | | | | | | --12.50%-- 0x8961de | | | | | |--1.19%-- reactor::run | | | | | | | |--87.50%-- smp::configure(boost::program_options::variables_map)::{lambda()#1}::operator() | | | | continuation<future<temporary_buffer<char> > future<>::then<future<temporary_buffer<char> > file::dma_read_bulk<char>(unsigned long, unsigned long)::{lambda(unsigned long)#1}::operator()(unsigned long)::{lambda()#3}, future<temporary_buffer<char> > >(future<temporary_buffer<char> > file::dma_read_bulk<char>(unsigned long, unsigned long)::{lambda(unsigned long)#1}::operator()(unsigned long)::{lambda()#3}&&)::{lambda(future<temporary_buffer<char> >)#1}>::run | | | | 0x600000043d00 | | | | | | | --12.50%-- app_template::run_deprecated | | | main | | | __libc_start_main | | | _GLOBAL__sub_I__ZN3org6apache9cassandra21g_cassandra_constantsE | | | 0x7f4ae20c9fa0 | | | | | |--1.04%-- __clock_gettime | | | std::chrono::_V2::system_clock::now | | | | | | | |--42.86%-- reactor::run | | | | smp::configure(boost::program_options::variables_map)::{lambda()#1}::operator() | | | | continuation<future<temporary_buffer<char> > future<>::then<future<temporary_buffer<char> > file::dma_read_bulk<char>(unsigned long, unsigned long)::{lambda(unsigned long)#1}::operator()(unsigned long)::{lambda()#3}, future<temporary_buffer<char> > >(future<temporary_buffer<char> > file::dma_read_bulk<char>(unsigned long, unsigned long)::{lambda(unsigned long)#1}::operator()(unsigned long)::{lambda()#3}&&)::{lambda(future<temporary_buffer<char> >)#1}>::run | | | | 0x600000043d00 | | | | | | | |--14.29%-- 0xa63209 | | | | | | | |--14.29%-- continuation<future<> future<>::finally<auto do_with<std::vector<frozen_mutation, std::allocator<frozen_mutation> >, shared_ptr<service::storage_proxy>, service::storage_proxy::init_messaging_service()::{lambda(std::vector<frozen_mutation, std::allocator<frozen_mutation> >)#1}::operator()(std::vector<frozen_mutation, std::allocator<frozen_mutation> >) const::{lambda(std::vector<frozen_mutation, std::allocator<frozen_mutation> > const&, shared_ptr<service::storage_proxy>&)#1}>(std::vector<frozen_mutation, std::allocator<frozen_mutation> >&&, shared_ptr<service::storage_proxy>&&, service::storage_proxy::init_messaging_service()::{lambda(std::vector<frozen_mutation, std::allocator<frozen_mutation> >)#1}::operator()(std::vector<frozen_mutation, std::allocator<frozen_mutation> >) const::{lambda(std::vector<frozen_mutation, std::allocator<frozen_mutation> > const&, shared_ptr<service::storage_proxy>&)#1}&&)::{lambda()#1}>(service::storage_proxy::init_messaging_service()::{lambda(std::vector<frozen_mutation, std::a | | | | 0x2b7434 | | | | | | | |--14.29%-- _ZN8futurizeI6futureIJSt10unique_ptrIN4cql317update_parametersESt14default_deleteIS3_EEEEE5applyIZNS2_10statements22modification_statement22make_update_parametersERN7seastar7shardedIN7service13storage_proxyEEE13lw_shared_ptrISt6vectorI13partition_keySaISK_EEESI_I26exploded_clustering_prefixERKNS2_13query_optionsEblEUlT_E_JNSt12experimental15fundamentals_v18optionalINS3_13prefetch_dataEEEEEES7_OST_OSt5tupleIJDpT0_EE | | | | cql3::statements::modification_statement::make_update_parameters | | | | cql3::statements::modification_statement::get_mutations | | | | cql3::statements::modification_statement::execute_without_condition | | | | cql3::query_options::query_options | | | | 0x7f4ad6bf80e0 | | | | | | | --14.29%-- database::apply_in_memory | | | database::do_apply | | | _ZN12continuationIZN6futureIJEE4thenIZN8database5applyERK15frozen_mutationEUlvE_S1_EET0_OT_EUlSA_E_JEE3runEv | | | reactor::del_timer | | | 0x6090000e2040 | | | | | |--1.04%-- memory::small_pool::allocate | | | | | | | |--14.29%-- 0x5257c379469d9 | | | | | | | |--14.29%-- 0x609002b9fe98 | | | | | | | |--14.29%-- 0x13c8b90 | | | | | | | |--14.29%-- 0x60f000190710 | | | | | | | |--14.29%-- 0x25 | | | | | | | |--14.29%-- 0x7f4ad6bf84c0 | | | | | | | --14.29%-- 0x7f4ad53f81f0 | | | | | |--0.89%-- db::serializer<atomic_cell_view>::serializer | | | mutation_partition_serializer::write_without_framing | | | frozen_mutation::frozen_mutation | | | frozen_mutation::frozen_mutation | | | | | |--0.89%-- do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&> | | | do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> > | | | do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> > | | | 0x60f0000c3000 | | | | | |--0.89%-- futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}, temporary_buffer> | | | futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}, future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> > > | | | transport::cql_server::connection::process_request | | | do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&> | | | do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> > | | | do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> > | | | | | | | |--83.33%-- 0x6090000c3000 | | | | | | | --16.67%-- 0x600000044400 | | | | | |--0.89%-- std::_Function_handler<void (), reactor::run()::{lambda()#8}>::_M_invoke | | | | | | | |--50.00%-- reactor::run | | | | smp::configure(boost::program_options::variables_map)::{lambda()#1}::operator() | | | | continuation<future<temporary_buffer<char> > future<>::then<future<temporary_buffer<char> > file::dma_read_bulk<char>(unsigned long, unsigned long)::{lambda(unsigned long)#1}::operator()(unsigned long)::{lambda()#3}, future<temporary_buffer<char> > >(future<temporary_buffer<char> > file::dma_read_bulk<char>(unsigned long, unsigned long)::{lambda(unsigned long)#1}::operator()(unsigned long)::{lambda()#3}&&)::{lambda(future<temporary_buffer<char> >)#1}>::run | | | | 0x600000043d00 | | | | | | | --50.00%-- reactor::signals::signal_handler::signal_handler | | | 0x3e8 | | | | | |--0.74%-- db::commitlog::segment::allocate | | | | | | | --100.00%-- db::commitlog::add | | | database::do_apply | | | | | | | |--75.00%-- database::apply | | | | smp_message_queue::async_work_item<future<> seastar::sharded<database>::invoke_on<service::storage_proxy::mutate_locally(mutation const&)::{lambda(database&)#1}, future<> >(unsigned int, service::storage_proxy::mutate_locally(mutation const&)::{lambda(database&)#1}&&)::{lambda()#1}>::process | | | | smp_message_queue::process_queue<2ul, smp_message_queue::process_incoming()::{lambda(smp_message_queue::work_item*)#1}> | | | | boost::lockfree::detail::ringbuffer_base<smp_message_queue::work_item*>::pop | | | | boost::program_options::variables_map::get | | | | | | | --25.00%-- _ZN12continuationIZN6futureIJEE4thenIZN8database5applyERK15frozen_mutationEUlvE_S1_EET0_OT_EUlSA_E_JEE3runEv | | | reactor::del_timer | | | 0x60b0000e2040 | | | | | |--0.74%-- service::storage_proxy::create_write_response_handler | | | | | |--0.74%-- transport::cql_server::connection::process_request_one | | | futurize<future<std::pair<foreign_ptr<shared_ptr<transport::cql_server::response> >, service::client_state> > >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}::operator()(temporary_buffer) const::{lambda()#1}::operator()()::{lambda()#1}&> | | | futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}, temporary_buffer> | | | futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}, future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> > > | | | | | | | |--80.00%-- transport::cql_server::connection::process_request | | | | do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&> | | | | do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> > | | | | do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> > | | | | 0x60a0000c3000 | | | | | | | --20.00%-- 0x8961de | | | | | |--0.74%-- compound_type<(allow_prefixes)0>::compare | | | | | | | |--20.00%-- 0x6030056c0f20 | | | | | | | |--20.00%-- boost::intrusive::bstbase2<boost::intrusive::mhtraits<rows_entry, boost::intrusive::set_member_hook<void, void, void, void>, &rows_entry::_link>, rows_entry::compare, (boost::intrusive::algo_types)5, boost::intrusive::detail::default_header_holder<boost::intrusive::rbtree_node_traits<void*, false> > >::find | | | | mutation_partition::clustered_row | | | | mutation::set_clustered_cell | | | | cql3::constants::setter::execute | | | | cql3::statements::update_statement::add_update_for_key | | | | _ZN8futurizeI6futureIJSt6vectorI8mutationSaIS2_EEEEE5applyIZN4cql310statements22modification_statement13get_mutationsERN7seastar7shardedIN7service13storage_proxyEEERKNS8_13query_optionsEblEUlT_E_JSt10unique_ptrINS8_17update_parametersESt14default_deleteISN_EEEEES5_OSK_OSt5tupleIJDpT0_EE | | | | cql3::statements::modification_statement::get_mutations | | | | cql3::statements::modification_statement::execute_without_condition | | | | cql3::query_options::query_options | | | | 0x7f4adb3f80e0 | | | | | | | |--20.00%-- compound_type<(allow_prefixes)0>::compare | | | | | | | |--20.00%-- mutation_partition::clustered_row | | | | boost::intrusive::bstree_impl<boost::intrusive::mhtraits<rows_entry, boost::intrusive::set_member_hook<void, void, void, void>, &rows_entry::_link>, rows_entry::compare, unsigned long, true, (boost::intrusive::algo_types)5, boost::intrusive::detail::default_header_holder<boost::intrusive::rbtree_node_traits<void*, false> > >::insert_unique | | | | boost::intrusive::bstree_algorithms<boost::intrusive::rbtree_node_traits<void*, false> >::prev_node | | | | 0x12d | | | | | | | --20.00%-- 0x60f00052daf0 | | | | | |--0.74%-- __memmove_ssse3_back | | | | | | | |--40.00%-- output_stream<char>::write | | | | | | | | | |--50.00%-- transport::cql_server::response::output | | | | | futurize<future<> >::apply<transport::cql_server::connection::write_response(foreign_ptr<shared_ptr<transport::cql_server::response> >&&)::{lambda()#1}> | | | | | | | | | --50.00%-- 0x7c7fb2 | | | | 0x5257c37847fa0 | | | | | | | |--20.00%-- transport::cql_server::connection::read_short_bytes | | | | transport::cql_server::connection::process_query | | | | 0x7f4ada7f86f0 | | | | | | | |--20.00%-- transport::cql_server::response::output | | | | futurize<future<> >::apply<transport::cql_server::connection::write_response(foreign_ptr<shared_ptr<transport::cql_server::response> >&&)::{lambda()#1}> | | | | 0x2 | | | | | | | --20.00%-- smp_message_queue::flush_response_batch | | | boost::lockfree::detail::ringbuffer_base<smp_message_queue::work_item*>::pop | | | boost::program_options::variables_map::get | | | | | |--0.74%-- syscall_work_queue::work_item_returning<syscall_result_extra<stat>, reactor::file_size(basic_sstring<char, unsigned int, 15u>)::{lambda()#1}>::~work_item_returning | | | | | | | |--60.00%-- 0x6130000c3000 | | | | | | | |--20.00%-- 0x608001fe59a0 | | | | | | | --20.00%-- 0x16 | | | | | |--0.74%-- __memset_sse2 | | | | | | | |--40.00%-- std::_Hashtable<range<dht::token>, std::pair<range<dht::token> const, std::unordered_set<gms::inet_address, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, std::allocator<gms::inet_address> > >, std::allocator<std::pair<range<dht::token> const, std::unordered_set<gms::inet_address, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, std::allocator<gms::inet_address> > > >, std::__detail::_Select1st, std::equal_to<range<dht::token> >, std::hash<range<dht::token> >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable | | | | locator::token_metadata::pending_endpoints_for | | | | service::storage_proxy::create_write_response_handler | | | | service::storage_proxy::mutate_prepare<std::vector<mutation, std::allocator<mutation> >, service::storage_proxy::mutate_prepare(std::vector<mutation, std::allocator<mutation> >&, db::consistency_level, db::write_type)::{lambda(mutation const&, db::consistency_level, db::write_type)#1}> | | | | service::storage_proxy::mutate | | | | service::storage_proxy::mutate_with_triggers | | | | cql3::statements::modification_statement::execute_without_condition | | | | cql3::statements::modification_statement::execute | | | | cql3::query_processor::process_statement | | | | transport::cql_server::connection::process_execute | | | | transport::cql_server::connection::process_request_one | | | | futurize<future<std::pair<foreign_ptr<shared_ptr<transport::cql_server::response> >, service::client_state> > >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}::operator()(temporary_buffer) const::{lambda()#1}::operator()()::{lambda()#1}&> | | | | futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}, temporary_buffer> | | | | futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}, future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> > > | | | | transport::cql_server::connection::process_request | | | | do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&> | | | | do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> > | | | | do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> > | | | | 0x6020000c3000 | | | | | | | |--40.00%-- service::digest_read_resolver::~digest_read_resolver | | | | | | | | | --100.00%-- 0x610002612b50 | | | | | | | --20.00%-- std::_Hashtable<basic_sstring<char, unsigned int, 15u>, std::pair<basic_sstring<char, unsigned int, 15u> const, std::vector<gms::inet_address, std::allocator<gms::inet_address> > >, std::allocator<std::pair<basic_sstring<char, unsigned int, 15u> const, std::vector<gms::inet_address, std::allocator<gms::inet_address> > > >, std::__detail::_Select1st, std::equal_to<basic_sstring<char, unsigned int, 15u> >, std::hash<basic_sstring<char, unsigned int, 15u> >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable | | | service::storage_proxy::send_to_live_endpoints | | | parallel_for_each<__gnu_cxx::__normal_iterator<unsigned long*, std::vector<unsigned long, std::allocator<unsigned long> > >, service::storage_proxy::mutate_begin(std::vector<unsigned long, std::allocator<unsigned long> >, db::consistency_level)::{lambda(unsigned long)#1}> | | | service::storage_proxy::mutate | | | service::storage_proxy::mutate_with_triggers | | | cql3::statements::modification_statement::execute_without_condition | | | cql3::statements::modification_statement::execute | | | cql3::query_processor::process_statement | | | transport::cql_server::connection::process_execute | | | transport::cql_server::connection::process_request_one | | | futurize<future<std::pair<foreign_ptr<shared_ptr<transport::cql_server::response> >, service::client_state> > >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}::operator()(temporary_buffer) const::{lambda()#1}::operator()()::{lambda()#1}&> | | | futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}, temporary_buffer> | | | futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}, future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> > > | | | transport::cql_server::connection::process_request | | | do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&> | | | do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> > | | | do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> > | | | 0x6070000c3000 | | | | | |--0.74%-- reactor::del_timer | | | | | | | |--80.00%-- 0x60a0000e2040 | | | | | | | --20.00%-- 0x6080000c3db0 | | | | | |--0.59%-- unimplemented::operator<< | | | | | | | |--25.00%-- _ZN12continuationIZN6futureIJ10shared_ptrIN9transport8messages14result_messageEEEE4thenIZN4cql315query_processor17process_statementES1_INS8_13cql_statementEERN7service11query_stateERKNS8_13query_optionsEEUlT_E_S6_EET0_OSI_EUlSL_E_JS5_EED2Ev | | | | 0x600100000008 | | | | | | | |--25.00%-- floating_type_impl<float>::from_string | | | | | | | |--25.00%-- 0x60e0000e4c10 | | | | | | | --25.00%-- _ZN12continuationIZN6futureIJ10shared_ptrIN9transport8messages14result_messageEEEE4thenIZN4cql315query_processor17process_statementES1_INS8_13cql_statementEERN7service11query_stateERKNS8_13query_optionsEEUlT_E_S6_EET0_OSI_EUlSL_E_JS5_EED2Ev | | | 0x600100000008 | | | | | |--0.59%-- std::_Hashtable<unsigned long, std::pair<unsigned long const, service::storage_proxy::rh_entry>, std::allocator<std::pair<unsigned long const, service::storage_proxy::rh_entry> >, std::__detail::_Select1st, std::equal_to<unsigned long>, std::hash<unsigned long>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_insert_unique_node | | | service::storage_proxy::register_response_handler | | | service::storage_proxy::create_write_response_handler | | | service::storage_proxy::create_write_response_handler | | | service::storage_proxy::mutate_prepare<std::vector<mutation, std::allocator<mutation> >, service::storage_proxy::mutate_prepare(std::vector<mutation, std::allocator<mutation> >&, db::consistency_level, db::write_type)::{lambda(mutation const&, db::consistency_level, db::write_type)#1}> | | | service::storage_proxy::mutate | | | service::storage_proxy::mutate_with_triggers | | | cql3::statements::modification_statement::execute_without_condition | | | cql3::statements::modification_statement::execute | | | cql3::query_processor::process_statement | | | transport::cql_server::connection::process_execute | | | transport::cql_server::connection::process_request_one | | | futurize<future<std::pair<foreign_ptr<shared_ptr<transport::cql_server::response> >, service::client_state> > >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}::operator()(temporary_buffer) const::{lambda()#1}::operator()()::{lambda()#1}&> | | | futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}, temporary_buffer> | | | futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}, future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> > > | | | transport::cql_server::connection::process_request | | | do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&> | | | do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> > | | | do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> > | | | 0x60b0000c3000 | | | | | |--0.59%-- mutation::set_clustered_cell | | | | | | | |--75.00%-- 0xa | | | | | | | --25.00%-- cql3::constants::setter::execute | | | cql3::statements::update_statement::add_update_for_key | | | _ZN8futurizeI6futureIJSt6vectorI8mutationSaIS2_EEEEE5applyIZN4cql310statements22modification_statement13get_mutationsERN7seastar7shardedIN7service13storage_proxyEEERKNS8_13query_optionsEblEUlT_E_JSt10unique_ptrINS8_17update_parametersESt14default_deleteISN_EEEEES5_OSK_OSt5tupleIJDpT0_EE | | | cql3::statements::modification_statement::get_mutations | | | cql3::statements::modification_statement::execute_without_condition | | | cql3::query_options::query_options | | | 0x7f4ad89f80e0 | | | | | |--0.59%-- memory::small_pool::small_pool | | | | | | | |--25.00%-- memory::stats | | | | boost::program_options::variables_map::get | | | | | | | |--25.00%-- memory::reclaimer::~reclaimer | | | | 0x1e | | | | | | | |--25.00%-- memory::allocate_aligned | | | | | | | --25.00%-- memory::small_pool::add_more_objects | | | memory::small_pool::add_more_objects | | | 0x6100000e0310 | | | | | |--0.59%-- __memcpy_sse2_unaligned | | | | | | | |--50.00%-- mutation_partition_applier::accept_row_cell | | | | mutation_partition_view::accept | | | | boost::intrusive::bstree_algorithms<boost::intrusive::rbtree_node_traits<void*, false> >::prev_node | | | | 0x12d | | | | | | | |--25.00%-- scanning_reader::operator() | | | | sstables::sstable::do_write_components | | | | sstables::sstable::prepare_write_components | | | | std::_Function_handler<void (), futurize<std::result_of<std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type ()>::type>::type seastar::async<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>(std::decay&&, (std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type&&)...)::{lambda(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work&)#1}::operator()(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work)::{lambda()#1}>::_M_invoke | | | | _GLOBAL__sub_I__ZN12app_templateC2Ev | | | | | | | --25.00%-- memtable::find_or_create_partition_slow | | | memtable::apply | | | database::apply_in_memory | | | database::do_apply | | | database::apply | | | smp_message_queue::async_work_item<future<> seastar::sharded<database>::invoke_on<service::storage_proxy::mutate_locally(mutation const&)::{lambda(database&)#1}, future<> >(unsigned int, service::storage_proxy::mutate_locally(mutation const&)::{lambda(database&)#1}&&)::{lambda()#1}>::process | | | smp_message_queue::process_queue<2ul, smp_message_queue::process_incoming()::{lambda(smp_message_queue::work_item*)#1}> | | | boost::lockfree::detail::ringbuffer_base<smp_message_queue::work_item*>::pop | | | boost::program_options::variables_map::get | | | | | |--0.59%-- smp_message_queue::flush_response_batch | | | | | | | |--25.00%-- boost::lockfree::detail::ringbuffer_base<smp_message_queue::work_item*>::pop | | | | boost::program_options::variables_map::get | | | | | | | |--25.00%-- 0x13 | | | | | | | |--25.00%-- 0x7f4ad5ff8f40 | | | | | | | --25.00%-- reactor::run | | | smp::configure(boost::program_options::variables_map)::{lambda()#1}::operator() | | | continuation<future<temporary_buffer<char> > future<>::then<future<temporary_buffer<char> > file::dma_read_bulk<char>(unsigned long, unsigned long)::{lambda(unsigned long)#1}::operator()(unsigned long)::{lambda()#3}, future<temporary_buffer<char> > >(future<temporary_buffer<char> > file::dma_read_bulk<char>(unsigned long, unsigned long)::{lambda(unsigned long)#1}::operator()(unsigned long)::{lambda()#3}&&)::{lambda(future<temporary_buffer<char> >)#1}>::run | | | 0x600000043d00 | | --54.38%-- [...] | | | |--14.26%-- schedule_timeout | | | | | |--38.52%-- wait_for_completion | | | | | | | |--90.07%-- flush_work | | | | xlog_cil_force_lsn | | | | | | | | | |--96.85%-- _xfs_log_force_lsn | | | | | | | | | | | |--79.67%-- xfs_file_fsync | | | | | | vfs_fsync_range | | | | | | do_fsync | | | | | | sys_fdatasync | | | | | | entry_SYSCALL_64_fastpath | | | | | | | | | | | | | --100.00%-- 0x7f4ade4212ad | | | | | | syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process | | | | | | 0x6030000c3ec0 | | | | | | | | | | | --20.33%-- xfs_dir_fsync | | | | | vfs_fsync_range | | | | | do_fsync | | | | | sys_fdatasync | | | | | entry_SYSCALL_64_fastpath | | | | | | | | | | | --100.00%-- 0x7f4ade4212ad | | | | | syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process | | | | | 0x6040000c3ec0 | | | | | | | | | --3.15%-- _xfs_log_force | | | | xfs_log_force | | | | xfs_buf_lock | | | | _xfs_buf_find | | | | xfs_buf_get_map | | | | xfs_trans_get_buf_map | | | | xfs_btree_get_bufl | | | | xfs_bmap_extents_to_btree | | | | xfs_bmap_add_extent_hole_real | | | | xfs_bmapi_write | | | | xfs_iomap_write_direct | | | | __xfs_get_blocks | | | | xfs_get_blocks_direct | | | | do_blockdev_direct_IO | | | | __blockdev_direct_IO | | | | xfs_vm_direct_IO | | | | xfs_file_dio_aio_write | | | | xfs_file_write_iter | | | | aio_run_iocb | | | | do_io_submit | | | | sys_io_submit | | | | entry_SYSCALL_64_fastpath | | | | io_submit | | | | 0x46d98a | | | | | | | --9.93%-- submit_bio_wait | | | blkdev_issue_flush | | | xfs_blkdev_issue_flush | | | xfs_file_fsync | | | vfs_fsync_range | | | do_fsync | | | sys_fdatasync | | | entry_SYSCALL_64_fastpath | | | | | | | --100.00%-- 0x7f4ade4212ad | | | syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process | | | 0x6030000c3ec0 | | | | | |--32.79%-- io_schedule_timeout | | | bit_wait_io | | | __wait_on_bit | | | | | | | |--51.67%-- wait_on_page_bit | | | | | | | | | |--95.16%-- filemap_fdatawait_range | | | | | filemap_write_and_wait_range | | | | | xfs_file_fsync | | | | | vfs_fsync_range | | | | | do_fsync | | | | | sys_fdatasync | | | | | entry_SYSCALL_64_fastpath | | | | | 0x7f4ade4212ad | | | | | syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process | | | | | 0x60b0000c3ec0 | | | | | | | | | --4.84%-- __migration_entry_wait | | | | migration_entry_wait | | | | handle_mm_fault | | | | __do_page_fault | | | | do_page_fault | | | | page_fault | | | | std::_Function_handler<void (), httpd::http_server::_date_format_timer::{lambda()#1}>::_M_invoke | | | | | | | | | --100.00%-- service::storage_proxy::mutate_prepare<std::vector<mutation, std::allocator<mutation> >, service::storage_proxy::mutate_prepare(std::vector<mutation, std::allocator<mutation> >&, db::consistency_level, db::write_type)::{lambda(mutation const&, db::consistency_level, db::write_type)#1}> | | | | service::storage_proxy::mutate | | | | service::storage_proxy::mutate_with_triggers | | | | cql3::statements::modification_statement::execute_without_condition | | | | cql3::statements::modification_statement::execute | | | | cql3::query_processor::process_statement | | | | transport::cql_server::connection::process_execute | | | | transport::cql_server::connection::process_request_one | | | | futurize<future<std::pair<foreign_ptr<shared_ptr<transport::cql_server::response> >, service::client_state> > >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}::operator()(temporary_buffer) const::{lambda()#1}::operator()()::{lambda()#1}&> | | | | futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}, temporary_buffer> | | | | futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}, future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> > > | | | | transport::cql_server::connection::process_request | | | | do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&> | | | | do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> > | | | | do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> > | | | | 0x6140000c3000 | | | | | | | --48.33%-- out_of_line_wait_on_bit | | | block_truncate_page | | | xfs_setattr_size | | | xfs_vn_setattr | | | notify_change | | | do_truncate | | | do_sys_ftruncate.constprop.15 | | | sys_ftruncate | | | entry_SYSCALL_64_fastpath | | | __GI___ftruncate64 | | | syscall_work_queue::work_item_returning<syscall_result_extra<stat>, posix_file_impl::stat()::{lambda()#1}>::process | | | | | | | |--13.79%-- 0x7f4ad29ff700 | | | | | | | |--13.79%-- 0x7f4acdbff700 | | | | | | | |--12.07%-- 0x7f4ad05ff700 | | | | | | | |--12.07%-- 0x7f4acedff700 | | | | | | | |--10.34%-- 0x7f4ad0bff700 | | | | | | | |--6.90%-- 0x7f4ad2fff700 | | | | | | | |--6.90%-- 0x7f4ad11ff700 | | | | | | | |--6.90%-- 0x7f4acf9ff700 | | | | | | | |--6.90%-- 0x7f4acf3ff700 | | | | | | | |--6.90%-- 0x7f4ace7ff700 | | | | | | | |--1.72%-- 0x7f4ad17ff700 | | | | | | | --1.72%-- 0x7f4aca5ff700 | | | | | --28.69%-- __down | | down | | xfs_buf_lock | | _xfs_buf_find | | xfs_buf_get_map | | | | | |--97.14%-- xfs_buf_read_map | | | xfs_trans_read_buf_map | | | | | | | |--98.04%-- xfs_read_agf | | | | xfs_alloc_read_agf | | | | xfs_alloc_fix_freelist | | | | | | | | | |--93.00%-- xfs_free_extent | | | | | xfs_bmap_finish | | | | | xfs_itruncate_extents | | | | | | | | | | | |--87.10%-- xfs_inactive_truncate | | | | | | xfs_inactive | | | | | | xfs_fs_evict_inode | | | | | | evict | | | | | | iput | | | | | | __dentry_kill | | | | | | dput | | | | | | __fput | | | | | | ____fput | | | | | | task_work_run | | | | | | do_notify_resume | | | | | | int_signal | | | | | | __libc_close | | | | | | std::experimental::fundamentals_v1::bad_optional_access::~bad_optional_access | | | | | | | | | | | --12.90%-- xfs_setattr_size | | | | | xfs_vn_setattr | | | | | notify_change | | | | | do_truncate | | | | | do_sys_ftruncate.constprop.15 | | | | | sys_ftruncate | | | | | entry_SYSCALL_64_fastpath | | | | | | | | | | | --100.00%-- __GI___ftruncate64 | | | | | syscall_work_queue::work_item_returning<syscall_result_extra<stat>, posix_file_impl::stat()::{lambda()#1}>::process | | | | | | | | | | | |--20.00%-- 0x7f4ad0bff700 | | | | | | | | | | | |--20.00%-- 0x7f4acedff700 | | | | | | | | | | | |--10.00%-- 0x7f4ad2fff700 | | | | | | | | | | | |--10.00%-- 0x7f4ad17ff700 | | | | | | | | | | | |--10.00%-- 0x7f4ad11ff700 | | | | | | | | | | | |--10.00%-- 0x7f4ad05ff700 | | | | | | | | | | | |--10.00%-- 0x7f4acf3ff700 | | | | | | | | | | | --10.00%-- 0x7f4acdbff700 | | | | | | | | | --7.00%-- xfs_alloc_vextent | | | | xfs_bmap_btalloc | | | | xfs_bmap_alloc | | | | xfs_bmapi_write | | | | xfs_iomap_write_direct | | | | __xfs_get_blocks | | | | xfs_get_blocks_direct | | | | do_blockdev_direct_IO | | | | __blockdev_direct_IO | | | | xfs_vm_direct_IO | | | | xfs_file_dio_aio_write | | | | xfs_file_write_iter | | | | aio_run_iocb | | | | do_io_submit | | | | sys_io_submit | | | | entry_SYSCALL_64_fastpath | | | | io_submit | | | | 0x46d98a | | | | | | | --1.96%-- xfs_read_agi | | | xfs_iunlink_remove | | | xfs_ifree | | | xfs_inactive_ifree | | | xfs_inactive | | | xfs_fs_evict_inode | | | evict | | | iput | | | __dentry_kill | | | dput | | | __fput | | | ____fput | | | task_work_run | | | do_notify_resume | | | int_signal | | | __libc_close | | | std::experimental::fundamentals_v1::bad_optional_access::~bad_optional_access | | | | | --2.86%-- xfs_trans_get_buf_map | | xfs_btree_get_bufl | | xfs_bmap_extents_to_btree | | xfs_bmap_add_extent_hole_real | | xfs_bmapi_write | | xfs_iomap_write_direct | | __xfs_get_blocks | | xfs_get_blocks_direct | | do_blockdev_direct_IO | | __blockdev_direct_IO | | xfs_vm_direct_IO | | xfs_file_dio_aio_write | | xfs_file_write_iter | | aio_run_iocb | | do_io_submit | | sys_io_submit | | entry_SYSCALL_64_fastpath | | io_submit | | 0x46d98a | | | |--13.48%-- eventfd_ctx_read | | eventfd_read | | __vfs_read | | vfs_read | | sys_read | | entry_SYSCALL_64_fastpath | | 0x7f4ade6f754d | | smp_message_queue::respond | | 0xffffffffffffffff | | | |--7.83%-- md_flush_request | | raid0_make_request | | md_make_request | | generic_make_request | | submit_bio | | | | | |--92.54%-- submit_bio_wait | | | blkdev_issue_flush | | | xfs_blkdev_issue_flush | | | xfs_file_fsync | | | vfs_fsync_range | | | do_fsync | | | sys_fdatasync | | | entry_SYSCALL_64_fastpath | | | | | | | --100.00%-- 0x7f4ade4212ad | | | syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process | | | 0x6010000c3ec0 | | | | | --7.46%-- _xfs_buf_ioapply | | xfs_buf_submit | | xlog_bdstrat | | xlog_sync | | xlog_state_release_iclog | | | | | |--73.33%-- _xfs_log_force_lsn | | | xfs_file_fsync | | | vfs_fsync_range | | | do_fsync | | | sys_fdatasync | | | entry_SYSCALL_64_fastpath | | | | | | | --100.00%-- 0x7f4ade4212ad | | | syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process | | | 0x6080000c3ec0 | | | | | --26.67%-- _xfs_log_force | | xfs_log_force | | xfs_buf_lock | | _xfs_buf_find | | xfs_buf_get_map | | xfs_trans_get_buf_map | | xfs_btree_get_bufl | | xfs_bmap_extents_to_btree | | xfs_bmap_add_extent_hole_real | | xfs_bmapi_write | | xfs_iomap_write_direct | | __xfs_get_blocks | | xfs_get_blocks_direct | | do_blockdev_direct_IO | | __blockdev_direct_IO | | xfs_vm_direct_IO | | xfs_file_dio_aio_write | | xfs_file_write_iter | | aio_run_iocb | | do_io_submit | | sys_io_submit | | entry_SYSCALL_64_fastpath | | io_submit | | 0x46d98a | | | |--5.53%-- _xfs_log_force_lsn | | | | | |--80.28%-- xfs_file_fsync | | | vfs_fsync_range | | | do_fsync | | | sys_fdatasync | | | entry_SYSCALL_64_fastpath | | | | | | | --100.00%-- 0x7f4ade4212ad | | | syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process | | | | | | | |--97.92%-- 0x60d0000c3ec0 | | | | | | | |--1.04%-- 0x6020000c3ec0 | | | | | | | --1.04%-- 0x600000557ec0 | | | | | --19.72%-- xfs_dir_fsync | | vfs_fsync_range | | do_fsync | | sys_fdatasync | | entry_SYSCALL_64_fastpath | | | | | --100.00%-- 0x7f4ade4212ad | | syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process | | 0x6040000c3ec0 | | | |--1.25%-- rwsem_down_read_failed | | call_rwsem_down_read_failed | | | | | |--90.62%-- xfs_ilock | | | | | | | |--86.21%-- xfs_ilock_data_map_shared | | | | __xfs_get_blocks | | | | xfs_get_blocks_direct | | | | do_blockdev_direct_IO | | | | __blockdev_direct_IO | | | | xfs_vm_direct_IO | | | | xfs_file_dio_aio_write | | | | xfs_file_write_iter | | | | aio_run_iocb | | | | do_io_submit | | | | sys_io_submit | | | | entry_SYSCALL_64_fastpath | | | | | | | | | --100.00%-- io_submit | | | | 0x46d98a | | | | | | | |--6.90%-- xfs_file_fsync | | | | vfs_fsync_range | | | | do_fsync | | | | sys_fdatasync | | | | entry_SYSCALL_64_fastpath | | | | 0x7f4ade4212ad | | | | syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process | | | | 0x6090000c3ec0 | | | | | | | --6.90%-- xfs_dir_fsync | | | vfs_fsync_range | | | do_fsync | | | sys_fdatasync | | | entry_SYSCALL_64_fastpath | | | 0x7f4ade4212ad | | | syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process | | | 0x6070000c3ec0 | | | | | --9.38%-- xfs_log_commit_cil | | __xfs_trans_commit | | xfs_trans_commit | | | | | |--33.33%-- xfs_setattr_size | | | xfs_vn_setattr | | | notify_change | | | do_truncate | | | do_sys_ftruncate.constprop.15 | | | sys_ftruncate | | | entry_SYSCALL_64_fastpath | | | __GI___ftruncate64 | | | syscall_work_queue::work_item_returning<syscall_result_extra<stat>, posix_file_impl::stat()::{lambda()#1}>::process | | | 0x7f4acedff700 | | | | | |--33.33%-- xfs_vn_update_time | | | file_update_time | | | xfs_file_aio_write_checks | | | xfs_file_dio_aio_write | | | xfs_file_write_iter | | | aio_run_iocb | | | do_io_submit | | | sys_io_submit | | | entry_SYSCALL_64_fastpath | | | io_submit | | | 0x46d98a | | | | | --33.33%-- xfs_bmap_add_attrfork | | xfs_attr_set | | xfs_initxattrs | | security_inode_init_security | | xfs_init_security | | xfs_generic_create | | xfs_vn_mknod | | xfs_vn_create | | vfs_create | | path_openat | | do_filp_open | | do_sys_open | | sys_open | | entry_SYSCALL_64_fastpath | | 0x7f4ade6f7cdd | | syscall_work_queue::work_item_returning<syscall_result<int>, reactor::open_file_dma(basic_sstring<char, unsigned int, 15u>, open_flags, file_open_options)::{lambda()#1}>::process | | 0xffffffffffffffff | | | |--0.97%-- rwsem_down_write_failed | | call_rwsem_down_write_failed | | xfs_ilock | | xfs_vn_update_time | | file_update_time | | xfs_file_aio_write_checks | | xfs_file_dio_aio_write | | xfs_file_write_iter | | aio_run_iocb | | do_io_submit | | sys_io_submit | | entry_SYSCALL_64_fastpath | | io_submit | | 0x46d98a | | | |--0.51%-- xlog_cil_force_lsn | | | | | |--92.31%-- _xfs_log_force_lsn | | | | | | | |--91.67%-- xfs_file_fsync | | | | vfs_fsync_range | | | | do_fsync | | | | sys_fdatasync | | | | entry_SYSCALL_64_fastpath | | | | 0x7f4ade4212ad | | | | syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process | | | | 0x60b0000c3ec0 | | | | | | | --8.33%-- xfs_dir_fsync | | | vfs_fsync_range | | | do_fsync | | | sys_fdatasync | | | entry_SYSCALL_64_fastpath | | | 0x7f4ade4212ad | | | syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process | | | 0x60d0000c3ec0 | | | | | --7.69%-- _xfs_log_force | | xfs_log_force | | xfs_buf_lock | | _xfs_buf_find | | xfs_buf_get_map | | xfs_trans_get_buf_map | | xfs_btree_get_bufl | | xfs_bmap_extents_to_btree | | xfs_bmap_add_extent_hole_real | | xfs_bmapi_write | | xfs_iomap_write_direct | | __xfs_get_blocks | | xfs_get_blocks_direct | | do_blockdev_direct_IO | | __blockdev_direct_IO | | xfs_vm_direct_IO | | xfs_file_dio_aio_write | | xfs_file_write_iter | | aio_run_iocb | | do_io_submit | | sys_io_submit | | entry_SYSCALL_64_fastpath | | io_submit | | 0x46d98a | --0.04%-- [...] | --3.82%-- preempt_schedule_common | |--99.02%-- _cond_resched | | | |--41.58%-- wait_for_completion | | | | | |--66.67%-- flush_work | | | xlog_cil_force_lsn | | | | | | | |--96.43%-- _xfs_log_force_lsn | | | | | | | | | |--77.78%-- xfs_file_fsync | | | | | vfs_fsync_range | | | | | do_fsync | | | | | sys_fdatasync | | | | | entry_SYSCALL_64_fastpath | | | | | 0x7f4ade4212ad | | | | | syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process | | | | | 0x6030000c3ec0 | | | | | | | | | --22.22%-- xfs_dir_fsync | | | | vfs_fsync_range | | | | do_fsync | | | | sys_fdatasync | | | | entry_SYSCALL_64_fastpath | | | | | | | | | --100.00%-- 0x7f4ade4212ad | | | | syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process | | | | 0x6030000c3ec0 | | | | | | | --3.57%-- _xfs_log_force | | | xfs_log_force | | | xfs_buf_lock | | | _xfs_buf_find | | | xfs_buf_get_map | | | xfs_trans_get_buf_map | | | xfs_btree_get_bufl | | | xfs_bmap_extents_to_btree | | | xfs_bmap_add_extent_hole_real | | | xfs_bmapi_write | | | xfs_iomap_write_direct | | | __xfs_get_blocks | | | xfs_get_blocks_direct | | | do_blockdev_direct_IO | | | __blockdev_direct_IO | | | xfs_vm_direct_IO | | | xfs_file_dio_aio_write | | | xfs_file_write_iter | | | aio_run_iocb | | | do_io_submit | | | sys_io_submit | | | entry_SYSCALL_64_fastpath | | | io_submit | | | 0x46d98a | | | | | --33.33%-- submit_bio_wait | | blkdev_issue_flush | | xfs_blkdev_issue_flush | | xfs_file_fsync | | vfs_fsync_range | | do_fsync | | sys_fdatasync | | entry_SYSCALL_64_fastpath | | | | | --100.00%-- 0x7f4ade4212ad | | syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process | | 0x6030000c3ec0 | | | |--33.66%-- flush_work | | xlog_cil_force_lsn | | | | | |--97.06%-- _xfs_log_force_lsn | | | | | | | |--78.79%-- xfs_file_fsync | | | | vfs_fsync_range | | | | do_fsync | | | | sys_fdatasync | | | | entry_SYSCALL_64_fastpath | | | | 0x7f4ade4212ad | | | | syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process | | | | 0x6030000c3ec0 | | | | | | | --21.21%-- xfs_dir_fsync | | | vfs_fsync_range | | | do_fsync | | | sys_fdatasync | | | entry_SYSCALL_64_fastpath | | | | | | | --100.00%-- 0x7f4ade4212ad | | | syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process | | | 0x6030000c3ec0 | | | | | --2.94%-- _xfs_log_force | | xfs_log_force | | xfs_buf_lock | | _xfs_buf_find | | xfs_buf_get_map | | xfs_trans_get_buf_map | | xfs_btree_get_bufl | | xfs_bmap_extents_to_btree | | xfs_bmap_add_extent_hole_real | | xfs_bmapi_write | | xfs_iomap_write_direct | | __xfs_get_blocks | | xfs_get_blocks_direct | | do_blockdev_direct_IO | | __blockdev_direct_IO | | xfs_vm_direct_IO | | xfs_file_dio_aio_write | | xfs_file_write_iter | | aio_run_iocb | | do_io_submit | | sys_io_submit | | entry_SYSCALL_64_fastpath | | io_submit | | 0x46d98a | | | |--13.86%-- lock_sock_nested | | | | | |--78.57%-- tcp_sendmsg | | | inet_sendmsg | | | sock_sendmsg | | | SYSC_sendto | | | sys_sendto | | | entry_SYSCALL_64_fastpath | | | __libc_send | | | _ZN12continuationIZN6futureIJmEE4thenIZN7reactor14write_all_partER17pollable_fd_statePKvmmEUlmE_S0_IJEEEET0_OT_EUlSC_E_JmEE3runEv | | | | | | | |--36.36%-- 0x7f4ad6bf8de0 | | | | | | | |--9.09%-- 0x4 | | | | | | | |--9.09%-- 0x7f4adadf8de0 | | | | | | | |--9.09%-- 0x7f4ada1f8de0 | | | | | | | |--9.09%-- 0x7f4ad89f8de0 | | | | | | | |--9.09%-- 0x7f4ad83f8de0 | | | | | | | |--9.09%-- 0x7f4ad4df8de0 | | | | | | | --9.09%-- 0x7f4ad35f8de0 | | | | | --21.43%-- tcp_recvmsg | | inet_recvmsg | | sock_recvmsg | | sock_read_iter | | __vfs_read | | vfs_read | | sys_read | | entry_SYSCALL_64_fastpath | | 0x7f4ade6f754d | | reactor::read_some | | | | | |--66.67%-- _ZN12continuationIZN6futureIJEE4thenIZZN7service13storage_proxy22send_to_live_endpointsEmENKUlRSt4pairIK13basic_sstringIcjLj15EESt6vectorIN3gms12inet_addressESaISB_EEEE_clESF_EUlvE_S1_EET0_OT_EUlSK_E_JEE3runEv | | | reactor::del_timer | | | 0x6160000e2040 | | | | | --33.33%-- continuation<future<> future<>::then_wrapped<future<> future<>::finally<auto seastar::with_gate<transport::cql_server::connection::process()::{lambda()#2}::operator()() const::{lambda()#1}>(seastar::gate&, transport::cql_server::connection::process()::{lambda()#2}::operator()() const::{lambda()#1}&&)::{lambda()#1}>(seastar::gate&)::{lambda(future<>)#1}::operator()(future<>)::{lambda(seastar::gate)#1}, future<> >(seastar::gate&)::{lambda(seastar::gate&)#1}>::run | | reactor::del_timer | | 0x6030000e2040 | | | |--3.96%-- generic_make_request_checks | | generic_make_request | | submit_bio | | do_blockdev_direct_IO | | __blockdev_direct_IO | | xfs_vm_direct_IO | | xfs_file_dio_aio_write | | xfs_file_write_iter | | aio_run_iocb | | do_io_submit | | sys_io_submit | | entry_SYSCALL_64_fastpath | | io_submit | | 0x46d98a | | | |--3.96%-- kmem_cache_alloc_node | | __alloc_skb | | sk_stream_alloc_skb | | tcp_sendmsg | | inet_sendmsg | | sock_sendmsg | | SYSC_sendto | | sys_sendto | | entry_SYSCALL_64_fastpath | | __libc_send | | _ZN12continuationIZN6futureIJmEE4thenIZN7reactor14write_all_partER17pollable_fd_statePKvmmEUlmE_S0_IJEEEET0_OT_EUlSC_E_JmEE3runEv | | | | | |--25.00%-- 0x7f4ad9bf8de0 | | | | | |--25.00%-- 0x7f4ad7df8de0 | | | | | |--25.00%-- 0x7f4ad77f8de0 | | | | | --25.00%-- 0x7f4ad59f8de0 | | | |--0.99%-- unmap_underlying_metadata | | do_blockdev_direct_IO | | __blockdev_direct_IO | | xfs_vm_direct_IO | | xfs_file_dio_aio_write | | xfs_file_write_iter | | aio_run_iocb | | do_io_submit | | sys_io_submit | | entry_SYSCALL_64_fastpath | | io_submit | | 0x46d98a | | | |--0.99%-- __kmalloc_node_track_caller | | __kmalloc_reserve.isra.32 | | __alloc_skb | | sk_stream_alloc_skb | | tcp_sendmsg | | inet_sendmsg | | sock_sendmsg | | SYSC_sendto | | sys_sendto | | entry_SYSCALL_64_fastpath | | __libc_send | | _ZN12continuationIZN6futureIJmEE4thenIZN7reactor14write_all_partER17pollable_fd_statePKvmmEUlmE_S0_IJEEEET0_OT_EUlSC_E_JmEE3runEv | | 0x7f4ad6bf8de0 | | | --0.99%-- task_work_run | do_notify_resume | int_signal | __libc_close | std::experimental::fundamentals_v1::bad_optional_access::~bad_optional_access | --0.98%-- __cond_resched_softirq release_sock tcp_sendmsg inet_sendmsg sock_sendmsg SYSC_sendto sys_sendto entry_SYSCALL_64_fastpath __libc_send _ZN12continuationIZN6futureIJmEE4thenIZN7reactor14write_all_partER17pollable_fd_statePKvmmEUlmE_S0_IJEEEET0_OT_EUlSC_E_JmEE3runEv 0x7f4ada1f8de0 # # (For a higher level overview, try: perf report --sort comm,dso) # [-- Attachment #3: example-stale.txt --] [-- Type: text/plain, Size: 1700 bytes --] [164814.835933] CPU: 22 PID: 48042 Comm: scylla Tainted: G E 4.2.6-200.fc22.x86_64 #1 [164814.835936] Hardware name: Xen HVM domU, BIOS 4.2.amazon 05/06/2015 [164814.835937] 0000000000000000 00000000a8713b7a ffff8802fb977ab8 ffffffff817729ea [164814.835941] 0000000000000000 ffff88076a69f780 ffff8802fb977ad8 ffffffffa03217a6 [164814.835946] ffff88077119bcb0 0000000000000000 ffff8802fb977b08 ffffffffa034e749 [164814.835951] Call Trace: [164814.835954] [<ffffffff817729ea>] dump_stack+0x45/0x57 [164814.835971] [<ffffffffa03217a6>] xfs_buf_stale+0x26/0x80 [xfs] [164814.835989] [<ffffffffa034e749>] xfs_trans_binval+0x79/0x100 [xfs] [164814.836001] [<ffffffffa02f479b>] xfs_bmap_btree_to_extents+0x12b/0x1a0 [xfs] [164814.836012] [<ffffffffa02f8977>] xfs_bunmapi+0x967/0x9f0 [xfs] [164814.836027] [<ffffffffa0334b9e>] xfs_itruncate_extents+0x10e/0x220 [xfs] [164814.836044] [<ffffffffa033f75a>] ? kmem_zone_alloc+0x5a/0xe0 [xfs] [164814.836084] [<ffffffffa0334d49>] xfs_inactive_truncate+0x99/0x110 [xfs] [164814.836120] [<ffffffffa0335aa2>] xfs_inactive+0x102/0x120 [xfs] [164814.836135] [<ffffffffa033a6cf>] xfs_fs_evict_inode+0x6f/0xa0 [xfs] [164814.836138] [<ffffffff81238d76>] evict+0xa6/0x170 [164814.836140] [<ffffffff81239026>] iput+0x196/0x220 [164814.836147] [<ffffffff81234fe4>] __dentry_kill+0x174/0x1c0 [164814.836150] [<ffffffff8123514b>] dput+0x11b/0x200 [164814.836155] [<ffffffff8121fe02>] __fput+0x172/0x1e0 [164814.836158] [<ffffffff8121febe>] ____fput+0xe/0x10 [164814.836161] [<ffffffff810bab75>] task_work_run+0x85/0xb0 [164814.836164] [<ffffffff81014a4d>] do_notify_resume+0x8d/0x90 [164814.836167] [<ffffffff817795bc>] int_signal+0x12/0x17 [-- Attachment #4: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-11-28 2:43 sleeps and waits during io_submit Glauber Costa @ 2015-11-30 14:10 ` Brian Foster 2015-11-30 14:29 ` Avi Kivity 2015-11-30 15:49 ` Glauber Costa 2015-11-30 23:10 ` Dave Chinner 1 sibling, 2 replies; 58+ messages in thread From: Brian Foster @ 2015-11-30 14:10 UTC (permalink / raw) To: Glauber Costa; +Cc: Avi Kivity, xfs On Fri, Nov 27, 2015 at 09:43:50PM -0500, Glauber Costa wrote: > Hello my dear XFSers, > > For those of you who don't know, we at ScyllaDB produce a modern NoSQL > data store that, at the moment, runs on top of XFS only. We deal > exclusively with asynchronous and direct IO, due to our > thread-per-core architecture. Due to that, we avoid issuing any > operation that will sleep. > > While debugging an extreme case of bad performance (most likely > related to a not-so-great disk), I have found a variety of cases in > which XFS blocks. To find those, I have used perf record -e > sched:sched_switch -p <pid_of_db>, and I am attaching the perf report > as xfs-sched_switch.log. Please note that this doesn't tell me for how > long we block, but as mentioned before, blocking operations outside > our control are detrimental to us regardless of the elapsed time. > > For those who are not acquainted to our internals, please ignore > everything in that file but the xfs functions. For the xfs symbols, > there are two kinds of events: the ones that are a children of > io_submit, where we don't tolerate blocking, and the ones that are > children of our helper IO thread, to where we push big operations that > we know will block until we can get rid of them all. We care about the > former and ignore the latter. > > Please allow me to ask you a couple of questions about those findings. > If we are doing anything wrong, advise on best practices is truly > welcome. > > 1) xfs_buf_lock -> xfs_log_force. > > I've started wondering what would make xfs_log_force sleep. But then I > have noticed that xfs_log_force will only be called when a buffer is > marked stale. Most of the times a buffer is marked stale seems to be > due to errors. Although that is not my case (more on that), it got me > thinking that maybe the right thing to do would be to avoid hitting > this case altogether? > I'm not following where you get the "only if marked stale" part..? It certainly looks like that's one potential purpose for the call, but this is called in a variety of other places as well. E.g., forcing the log via pushing on the ail when it has pinned items is another case. The ail push itself can originate from transaction reservation, etc., when log space is needed. In other words, I'm not sure this is something that's easily controlled from userspace, if at all. Rather, it's a significant part of the wider state machine the fs uses to manage logging. > The file example-stale.txt contains a backtrace of the case where we > are being marked as stale. It seems to be happening when we convert > the the inode's extents from unwritten to real. Can this case be > avoided? I won't pretend I know the intricacies of this, but couldn't > we be keeping extents from the very beginning to avoid creating stale > buffers? > This is down in xfs_fs_evict_inode()->xfs_inactive(), which is generally when an inode is evicted from cache. In this case, it looks like the inode is unlinked (permanently removed), the extents are being removed and a bmap btree block is being invalidated as part of that overall process. I don't think this has anything to do with unwritten extents. > 2) xfs_buf_lock -> down > This is one I truly don't understand. What can be causing contention > in this lock? We never have two different cores writing to the same > buffer, nor should we have the same core doingCAP_FOWNER so. > This is not one single lock. An XFS buffer is the data structure used to modify/log/read-write metadata on-disk and each buffer has its own lock to prevent corruption. Buffer lock contention is possible because the filesystem has bits of "global" metadata that has to be updated via buffers. For example, usually one has multiple allocation groups to maximize parallelism, but we still have per-ag metadata that has to be tracked globally with respect to each AG (e.g., free space trees, inode allocation trees, etc.). Any operation that affects this metadata (e.g., block/inode allocation) has to lock the agi/agf buffers along with any buffers associated with the modified btree leaf/node blocks, etc. One example in your attached perf traces has several threads looking to acquire the AGF, which is a per-AG data structure for tracking free space in the AG. One thread looks like the inode eviction case noted above (freeing blocks), another looks like a file truncate (also freeing blocks), and yet another is a block allocation due to a direct I/O write. Were any of these operations directed to an inode in a separate AG, they would be able to proceed in parallel (but I believe they would still hit the same codepaths as far as perf can tell). > 3) xfs_file_aio_write_checks -> file_update_time -> xfs_vn_update_time > > You guys seem to have an interface to avoid that, by setting the > FMODE_NOCMTIME flag. This is done by issuing the open by handle ioctl, > which will set this flag for all regular files. That's great, but that > ioctl required CAP_SYS_ADMIN, which is a big no for us, since we run > our server as an unprivileged user. I don't understand, however, why > such an strict check is needed. If we have full rights on the > filesystem, why can't we issue this operation? In my view, CAP_FOWNER > should already be enough.I do understand the handles have to be stable > and a file can have its ownership changed, in which case the previous > owner would keep the handle valid. Is that the reason you went with > the most restrictive capability ? I'm not familiar enough with the open-by-handle stuff to comment on the permission constraints. Perhaps Dave or others can comment further on this bit... Brian > # To display the perf.data header info, please use --header/--header-only options. > # > # > # Total Lost Samples: 0 > # > # Samples: 2K of event 'sched:sched_switch' > # Event count (approx.): 2669 > # > # Overhead Command Shared Object Symbol > # ........ ....... ................. .............. > # > 100.00% scylla [kernel.kallsyms] [k] __schedule > | > ---__schedule > | > |--96.18%-- schedule > | | > | |--56.14%-- schedule_user > | | | > | | |--53.30%-- int_careful > | | | | > | | | |--45.05%-- 0x7f4ade6f74ed > | | | | reactor_backend_epoll::make_reactor_notifier > | | | | | > | | | | |--67.63%-- syscall_work_queue::submit_item > | | | | | | > | | | | | |--32.05%-- posix_file_impl::truncate > | | | | | | | > | | | | | | |--65.33%-- _ZN12continuationIZN6futureIJEE4thenIZN19file_data_sink_impl5flushEvEUlvE_S1_EET0_OT_EUlS7_E_JEE3runEv > | | | | | | | reactor::del_timer > | | | | | | | 0x60b0000e2040 > | | | | | | | > | | | | | | |--20.00%-- db::commitlog::segment::flush(unsigned long)::{lambda()#1}::operator() > | | | | | | | | > | | | | | | | |--73.33%-- future<>::then<db::commitlog::segment::flush(unsigned long)::{lambda()#1}, future<lw_shared_ptr<db::commitlog::segment> > > > | | | | | | | | _ZN12continuationIZN6futureIJ13lw_shared_ptrIN2db9commitlog7segmentEEEE4thenIZNS4_4syncEvEUlT_E_S6_EET0_OS8_EUlSB_E_JS5_EE3runEv > | | | | | | | | reactor::del_timer > | | | | | | | | 0x60e0000e2040 > | | | | | | | | > | | | | | | | --26.67%-- _ZN12continuationIZN6futureIJEE4thenIZN2db9commitlog7segment5flushEmEUlvE_S0_IJ13lw_shared_ptrIS5_EEEEET0_OT_EUlSC_E_JEE3runEv > | | | | | | | reactor::del_timer > | | | | | | | 0x6090000e2040 > | | | | | | | > | | | | | | |--10.67%-- sstables::sstable::seal_sstable > | | | | | | | std::_Function_handler<void (), futurize<std::result_of<std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type ()>::type>::type seastar::async<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>(std::decay&&, (std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type&&)...)::{lambda(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work&)#1}::operator()(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambd a()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work)::{lambda()#1}>::_M_invoke > | | | | | | | _GLOBAL__sub_I__ZN12app_templateC2Ev > | | | | | | | > | | | | | | --4.00%-- sstables::sstable::write_toc > | | | | | | sstables::sstable::prepare_write_components > | | | | | | | > | | | | | | |--50.00%-- 0x4d3a4f6ec4e8cd75 > | | | | | | | > | | | | | | --50.00%-- 0x3ebf3dd80e3b174d > | | | | | | > | | | | | |--23.93%-- posix_file_impl::discard > | | | | | | | > | | | | | | |--82.14%-- _ZN12continuationIZN6futureIImEE4thenIZN19file_data_sink_impl6do_putEm16temporary_bufferIcEEUlmE_S0_IIEEEET0_OT_EUlSA_E_ImEE3runEv > | | | | | | | reactor::del_timer > | | | | | | | 0x6080000e2040 > | | | | | | | > | | | | | | --17.86%-- futurize<future<lw_shared_ptr<db::commitlog::segment> > >::apply<db::commitlog::segment_manager::allocate_segment(bool)::{lambda(file)#1}, file> > | | | | | | _ZN12continuationIZN6futureIJ4fileEE4thenIZN2db9commitlog15segment_manager16allocate_segmentEbEUlS1_E_S0_IJ13lw_shared_ptrINS5_7segmentEEEEEET0_OT_EUlSE_E_JS1_EE3runEv > | | | | | | > | | | | | |--20.94%-- reactor::open_file_dma > | | | | | | | > | | | | | | |--20.41%-- db::commitlog::segment_manager::allocate_segment > | | | | | | | db::commitlog::segment_manager::on_timer()::{lambda()#1}::operator() > | | | | | | | 0xb8c264 > | | | | | | | > | | | | | | |--14.29%-- sstables::sstable::write_simple<(sstables::sstable::component_type)8, sstables::statistics> > | | | | | | | std::_Function_handler<void (), futurize<std::result_of<std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type ()>::type>::type seastar::async<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>(std::decay&&, (std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type&&)...)::{lambda(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work&)#1}::operator()(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambd a()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work)::{lambda()#1}>::_M_invoke > | | | | | | | _GLOBAL__sub_I__ZN12app_templateC2Ev > | | | | | | | > | | | | | | |--12.24%-- sstables::write_crc > | | | | | | | | > | | | | | | | |--16.67%-- 0x313532343536002f > | | | | | | | | > | | | | | | | |--16.67%-- 0x373633323533002f > | | | | | | | | > | | | | | | | |--16.67%-- 0x363139333232002f > | | | | | | | | > | | | | | | | |--16.67%-- 0x353933303330002f > | | | | | | | | > | | | | | | | |--16.67%-- 0x383930383133002f > | | | | | | | | > | | | | | | | --16.67%-- 0x323338303037002f > | | | | | | | > | | | | | | |--12.24%-- sstables::write_digest > | | | | | | | > | | | | | | |--10.20%-- sstables::sstable::write_simple<(sstables::sstable::component_type)7, sstables::filter> > | | | | | | | sstables::sstable::write_filter > | | | | | | | std::_Function_handler<void (), futurize<std::result_of<std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type ()>::type>::type seastar::async<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>(std::decay&&, (std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type&&)...)::{lambda(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work&)#1}::operator()(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambd a()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work)::{lambda()#1}>::_M_invoke > | | | | | | | _GLOBAL__sub_I__ZN12app_templateC2Ev > | | | | | | | > | | | | | | |--10.20%-- sstables::sstable::write_simple<(sstables::sstable::component_type)4, sstables::summary_ka> > | | | | | | | std::_Function_handler<void (), futurize<std::result_of<std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type ()>::type>::type seastar::async<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>(std::decay&&, (std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type&&)...)::{lambda(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work&)#1}::operator()(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambd a()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work)::{lambda()#1}>::_M_invoke > | | | | | | | _GLOBAL__sub_I__ZN12app_templateC2Ev > | | | | | | | > | | | | | | |--10.20%-- 0x78d93b > | | | | | | | > | | | | | | |--6.12%-- sstables::sstable::open_data > | | | | | | | | > | | | | | | | --100.00%-- 0x8000000004000000 > | | | | | | | > | | | | | | --4.08%-- sstables::sstable::write_toc > | | | | | | sstables::sstable::prepare_write_components > | | | | | | | > | | | | | | --100.00%-- 0x6100206690ef > | | | | | | > | | | | | |--18.38%-- syscall_work_queue::submit_item > | | | | | | | > | | | | | | |--10.00%-- 0x7f4ad89f8fe0 > | | | | | | | > | | | | | | |--7.50%-- 0x7f4ad83f8fe0 > | | | | | | | > | | | | | | |--7.50%-- 0x7f4ad6bf8fe0 > | | | | | | | > | | | | | | |--7.50%-- 0x7f4ad65f8fe0 > | | | | | | | > | | | | | | |--5.00%-- 0x60b015e8cd90 > | | | | | | | > | | | | | | |--5.00%-- 0x60100acaed90 > | | | | | | | > | | | | | | |--5.00%-- 0x607006f04d90 > | | | | | | | > | | | | | | |--5.00%-- 0xffffffffffffa5d0 > | | | | | | | > | | | | | | |--2.50%-- 0x60e01acbed90 > | | | | | | | > | | | | | | |--2.50%-- 0x60e01acbec60 > | | | | | | | > | | | | | | |--2.50%-- 0x60a018d7ad90 > | | | | | | | > | | | | | | |--2.50%-- 0x60a018d7ac60 > | | | | | | | > | | | | | | |--2.50%-- 0x60b015e8cc60 > | | | | | | | > | | | | | | |--2.50%-- 0x60900bb8ad60 > | | | | | | | > | | | | | | |--2.50%-- 0x60100acaec60 > | | | | | | | > | | | | | | |--2.50%-- 0x60800951dd90 > | | | | | | | > | | | | | | |--2.50%-- 0x60800951dc60 > | | | | | | | > | | | | | | |--2.50%-- 0x60d009089d90 > | | | | | | | > | | | | | | |--2.50%-- 0x60d009089c60 > | | | | | | | > | | | | | | |--2.50%-- 0x607006f04c60 > | | | | | | | > | | | | | | |--2.50%-- 0x60f005984d60 > | | | | | | | > | | | | | | |--2.50%-- 0x7f4ad77f8fe0 > | | | | | | | > | | | | | | |--2.50%-- 0x7f4adb9f8fe0 > | | | | | | | > | | | | | | |--2.50%-- 0x7f4ad9bf8fe0 > | | | | | | | > | | | | | | |--2.50%-- 0x7f4ad7df8fe0 > | | | | | | | > | | | | | | |--2.50%-- 0x7f4ad77f8fe0 > | | | | | | | > | | | | | | --2.50%-- 0x7f4ad5ff8fe0 > | | | | | | > | | | | | |--2.99%-- reactor::open_directory > | | | | | | | > | | | | | | |--57.14%-- sstables::sstable::filename > | | | | | | | > | | | | | | --42.86%-- sstables::sstable::write_toc > | | | | | | sstables::sstable::prepare_write_components > | | | | | | | > | | | | | | |--50.00%-- 0x4d3a4f6ec4e8cd75 > | | | | | | | > | | | | | | --50.00%-- 0x3ebf3dd80e3b174d > | | | | | | > | | | | | --1.71%-- reactor::rename_file > | | | | | sstables::sstable::seal_sstable > | | | | | std::_Function_handler<void (), futurize<std::result_of<std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type ()>::type>::type seastar::async<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>(std::decay&&, (std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type&&)...)::{lambda(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work&)#1}::operator()(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::ty pe ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work)::{lambda()#1}>::_M_invoke > | | | | | _GLOBAL__sub_I__ZN12app_templateC2Ev > | | | | | > | | | | --32.37%-- _ZN12continuationIZN6futureIJEE4thenIZN18syscall_work_queue11submit_itemEPNS3_9work_itemEEUlvE_S1_EET0_OT_EUlS9_E_JEE3runEv > | | | | reactor::del_timer > | | | | 0x60d0000e2040 > | | | | > | | | |--29.04%-- __vdso_clock_gettime > | | | | > | | | |--19.66%-- 0x7f4ade42b193 > | | | | reactor_backend_epoll::complete_epoll_event > | | | | | > | | | | |--41.61%-- smp_message_queue::async_work_item<future<> seastar::sharded<database>::invoke_on<service::storage_proxy::mutate_locally(frozen_mutation const&)::{lambda(database&)#1}, future<> >(unsigned int, service::storage_proxy::mutate_locally(frozen_mutation const&)::{lambda(database&)#1}&&)::{lambda()#1}>::process > | | | | | | > | | | | | |--79.03%-- smp_message_queue::process_queue<2ul, smp_message_queue::process_incoming()::{lambda(smp_message_queue::work_item*)#1}> > | | | | | | | > | | | | | | |--95.92%-- 0x6070000c3000 > | | | | | | | > | | | | | | |--2.04%-- 0x61d0000c1000 > | | | | | | | > | | | | | | --2.04%-- 0x61d0000c1000 > | | | | | | > | | | | | |--3.23%-- 0x14dd51 > | | | | | | > | | | | | |--1.61%-- 0x162a54 > | | | | | | > | | | | | |--1.61%-- 0x161dca > | | | | | | > | | | | | |--1.61%-- 0x159c8b > | | | | | | > | | | | | |--1.61%-- 0x1598b5 > | | | | | | > | | | | | |--1.61%-- 0x14dd3e > | | | | | | > | | | | | |--1.61%-- 0x14bad8 > | | | | | | > | | | | | |--1.61%-- 0x14a880 > | | | | | | > | | | | | |--1.61%-- 0x127105 > | | | | | | > | | | | | |--1.61%-- 0x6070000e2040 > | | | | | | > | | | | | |--1.61%-- smp_message_queue::process_queue<2ul, smp_message_queue::process_incoming()::{lambda(smp_message_queue::work_item*)#1}> > | | | | | | 0x60d0000c3000 > | | | | | | > | | | | | --1.61%-- __vdso_clock_gettime > | | | | | 0x7f4ad77f9160 > | | | | | > | | | | |--30.20%-- __restore_rt > | | | | | | > | | | | | |--57.14%-- __vdso_clock_gettime > | | | | | | 0x1d > | | | | | | > | | | | | |--9.52%-- smp_message_queue::smp_message_queue > | | | | | | 0x6070000c3000 > | | | | | | > | | | | | |--4.76%-- 0x600000357240 > | | | | | | > | | | | | |--4.76%-- 0x60000031a640 > | | | | | | > | | | | | |--2.38%-- posix_file_impl::list_directory > | | | | | | 0x609000044730 > | | | | | | > | | | | | |--2.38%-- 0x46efbf > | | | | | | > | | | | | |--2.38%-- 0x600000442e40 > | | | | | | > | | | | | |--2.38%-- 0x600000376440 > | | | | | | > | | | | | |--2.38%-- 0x6000002bac40 > | | | | | | > | | | | | |--2.38%-- 0x600000295640 > | | | | | | > | | | | | |--2.38%-- 0x600000289e40 > | | | | | | > | | | | | |--2.38%-- 0x60000031a640 > | | | | | | > | | | | | |--2.38%-- 0x7f4ade6f74ed > | | | | | | __libc_siglongjmp > | | | | | | 0x60000047be40 > | | | | | | > | | | | | --2.38%-- 0x7f4adb3f7fd0 > | | | | | > | | | | |--14.09%-- 0x33 > | | | | | > | | | | |--12.08%-- promise<temporary_buffer<char> >::promise > | | | | | _ZN6futureIJ16temporary_bufferIcEEE4thenIZN12input_streamIcE12read_exactlyEmEUlT_E_S2_EET0_OS6_ > | | | | | | > | | | | | |--44.44%-- input_stream<char>::read_exactly > | | | | | | 0x8 > | | | | | | > | | | | | |--11.11%-- 0x7f4adb3f8ea0 > | | | | | | > | | | | | |--11.11%-- 0x7f4ad9bf8ea0 > | | | | | | > | | | | | |--11.11%-- 0x7f4ad89f8ea0 > | | | | | | > | | | | | |--11.11%-- 0x7f4ad83f8ea0 > | | | | | | > | | | | | |--5.56%-- 0x7f4ad77f8ea0 > | | | | | | > | | | | | --5.56%-- 0x7f4ad7df8ea0 > | | | | | > | | | | |--1.34%-- 0x7f4ad6bf8d80 > | | | | | > | | | | --0.67%-- 0x7f4adadf8d80 > | | | | > | | | |--4.43%-- __libc_send > | | | | _ZN12continuationIZN6futureIJmEE4thenIZN7reactor14write_all_partER17pollable_fd_statePKvmmEUlmE_S0_IJEEEET0_OT_EUlSC_E_JmEE3runEv > | | | | | > | | | | |--14.71%-- 0x4 > | | | | | > | | | | |--11.76%-- 0x7f4ad89f8de0 > | | | | | > | | | | |--8.82%-- 0x7f4adb3f8de0 > | | | | | > | | | | |--8.82%-- 0x7f4ad9bf8de0 > | | | | | > | | | | |--8.82%-- 0x7f4ad77f8de0 > | | | | | > | | | | |--8.82%-- 0x7f4ad6bf8de0 > | | | | | > | | | | |--5.88%-- 0x7f4ad83f8de0 > | | | | | > | | | | |--5.88%-- 0x7f4ad7df8de0 > | | | | | > | | | | |--5.88%-- 0x7f4ad53f8de0 > | | | | | > | | | | |--2.94%-- 0x7f4acc9f8de0 > | | | | | > | | | | |--2.94%-- continuation<future<file>::wait()::{lambda(future_state<file>&&)#1}, file>::~continuation > | | | | | 0x611003c8e9b8 > | | | | | > | | | | |--2.94%-- 0x7f4adb9f8de0 > | | | | | > | | | | |--2.94%-- 0x7f4ad71f8de0 > | | | | | > | | | | |--2.94%-- 0x7f4ad65f8de0 > | | | | | > | | | | |--2.94%-- 0x7f4ad59f8de0 > | | | | | > | | | | --2.94%-- 0x7f4ad35f8de0 > | | | | > | | | |--1.56%-- 0x7f4ade6f754d > | | | | reactor::read_some > | | | | | > | | | | |--66.67%-- _ZN12continuationIZN6futureIJEE4thenIZZN7service13storage_proxy22send_to_live_endpointsEmENKUlRSt4pairIK13basic_sstringIcjLj15EESt6vectorIN3gms12inet_addressESaISB_EEEE_clESF_EUlvE_S1_EET0_OT_EUlSK_E_JEE3runEv > | | | | | reactor::del_timer > | | | | | 0x6070000e2040 > | | | | | > | | | | |--8.33%-- _ZN12continuationIZN6futureIIEE4thenIZ5sleepINSt6chrono3_V212system_clockEmSt5ratioILl1ELl1000000EEES1_NS4_8durationIT0_T1_EEEUlvE_S1_EESA_OT_EUlSF_E_IEE3runEv > | | | | | reactor::del_timer > | | | | | 0x6080000e2040 > | | | | | > | | | | |--8.33%-- 0x600000483640 > | | | | | > | | | | |--8.33%-- 0x600000480440 > | | | | | > | | | | --8.33%-- 0x36 > | | | --0.26%-- [...] > | | | > | | --46.70%-- retint_careful > | | | > | | |--6.24%-- posix_file_impl::list_directory > | | | | > | | | |--80.00%-- 0x60f0000e2020 > | | | | > | | | |--5.00%-- 0x601000044730 > | | | | > | | | |--5.00%-- 0x60e000044720 > | | | | > | | | |--2.50%-- 0x60f000135500 > | | | | > | | | |--2.50%-- 0x6190000e2098 > | | | | > | | | |--2.50%-- 0x60d0000c3000 > | | | | > | | | --2.50%-- 0x1 > | | | > | | |--3.42%-- boost::lockfree::detail::ringbuffer_base<smp_message_queue::work_item*>::pop > | | | | > | | | |--95.65%-- boost::program_options::variables_map::get > | | | | > | | | --4.35%-- 0x618000044680 > | | | > | | |--3.12%-- memory::small_pool::add_more_objects > | | | | > | | | |--10.53%-- managed_vector<atomic_cell_or_collection, 5u, unsigned int>::clear_and_release > | | | | mutation_partition::clustered_row > | | | | mutation::set_clustered_cell > | | | | cql3::constants::setter::execute > | | | | cql3::statements::update_statement::add_update_for_key > | | | | _ZN8futurizeI6futureIJSt6vectorI8mutationSaIS2_EEEEE5applyIZN4cql310statements22modification_statement13get_mutationsERN7seastar7shardedIN7service13storage_proxyEEERKNS8_13query_optionsEblEUlT_E_JSt10unique_ptrINS8_17update_parametersESt14default_deleteISN_EEEEES5_OSK_OSt5tupleIJDpT0_EE > | | | | cql3::statements::modification_statement::get_mutations > | | | | cql3::statements::modification_statement::execute_without_condition > | | | | cql3::query_options::query_options > | | | | | > | | | | |--50.00%-- 0x7f4ad77f80e0 > | | | | | > | | | | --50.00%-- 0x7f4ad6bf80e0 > | | | | > | | | |--10.53%-- memory::small_pool::add_more_objects > | | | | | > | | | | |--50.00%-- 0x60e00015d000 > | | | | | > | | | | --50.00%-- 0x60b00af6c758 > | | | | > | | | |--5.26%-- 0x60a018ee3867 > | | | | > | | | |--5.26%-- 0x60d00d41f680 > | | | | > | | | |--5.26%-- 0x61400c6bb4d0 > | | | | > | | | |--5.26%-- 0x60e007c918d6 > | | | | > | | | |--5.26%-- 0x60e0078294ce > | | | | > | | | |--5.26%-- 0x607006ee4da0 > | | | | > | | | |--5.26%-- _ZN12continuationIZN6futureIJEE12then_wrappedIZNS1_16handle_exceptionIZN7service13storage_proxy22send_to_live_endpointsEmEUlNSt15__exception_ptr13exception_ptrEE0_EES1_OT_EUlSA_E_S1_EET0_SA_EUlSA_E_JEE3runEv > | | | | reactor::del_timer > | | | | 0x6030000e2040 > | | | | > | | | |--5.26%-- service::storage_proxy::mutate_locally > | | | | service::storage_proxy::send_to_live_endpoints > | | | | parallel_for_each<__gnu_cxx::__normal_iterator<unsigned long*, std::vector<unsigned long, std::allocator<unsigned long> > >, service::storage_proxy::mutate_begin(std::vector<unsigned long, std::allocator<unsigned long> >, db::consistency_level)::{lambda(unsigned long)#1}> > | | | | 0x601000136d00 > | | | | > | | | |--5.26%-- 0x60a0001900e0 > | | | | > | | | |--5.26%-- 0x60e00015d040 > | | | | > | | | |--5.26%-- 0x61300015d000 > | | | | > | | | |--5.26%-- 0x60e00013bde0 > | | | | > | | | |--5.26%-- 0x60b00010f308 > | | | | > | | | |--5.26%-- 0x6010000e4808 > | | | | > | | | --5.26%-- 0x7f4ad65f7f50 > | | | > | | |--2.82%-- std::unique_ptr<reactor::pollfn, std::default_delete<std::unique_ptr> > reactor::make_pollfn<reactor::run()::{lambda()#3}>(reactor::run()::{lambda()#3}&&)::the_pollfn::poll_and_check_more_work > | | | | > | | | |--25.00%-- boost::lockfree::detail::ringbuffer_base<smp_message_queue::work_item*>::pop > | | | | boost::program_options::variables_map::get > | | | | > | | | |--25.00%-- 0x1 > | | | | > | | | |--12.50%-- 0x53 > | | | | > | | | |--12.50%-- 0x3e > | | | | > | | | |--12.50%-- 0x24 > | | | | > | | | --12.50%-- 0xb958000000000000 > | | | > | | |--2.67%-- std::_Function_handler<partition_presence_checker_result (partition_key const&), column_family::make_partition_presence_checker(lw_shared_ptr<std::map<long, lw_shared_ptr<sstables::sstable>, std::less<long>, std::allocator<std::pair<long const, lw_shared_ptr<sstables::sstable> > > > >)::{lambda(partition_key const&)#1}>::_M_invoke > | | | | > | | | |--66.67%-- 0x1b5c280 > | | | | > | | | |--27.78%-- managed_vector<atomic_cell_or_collection, 5u, unsigned int>::resize > | | | | row::apply > | | | | mutation_partition_applier::accept_row_cell > | | | | mutation_partition_view::accept > | | | | > | | | --5.56%-- 0x2a4399 > | | | > | | |--2.08%-- smp_message_queue::smp_message_queue > | | | | > | | | |--60.00%-- 0x60f0000c3000 > | | | | > | | | |--10.00%-- 0x6000002d7240 > | | | | > | | | |--10.00%-- 0x19 > | | | | > | | | |--10.00%-- 0xb > | | | | > | | | --10.00%-- 0x7 > | | | > | | |--1.93%-- smp_message_queue::process_queue<4ul, smp_message_queue::process_completions()::{lambda(smp_message_queue::work_item*)#1}> > | | | > | | |--1.63%-- __vdso_clock_gettime > | | | | > | | | --100.00%-- __clock_gettime > | | | std::chrono::_V2::system_clock::now > | | | 0xa63209 > | | | > | | |--1.49%-- memory::small_pool::deallocate > | | | | > | | | |--40.00%-- managed_vector<atomic_cell_or_collection, 5u, unsigned int>::emplace_back<atomic_cell_or_collection> > | | | | > | | | |--20.00%-- std::_Hashtable<unsigned long, std::pair<unsigned long const, service::storage_proxy::rh_entry>, std::allocator<std::pair<unsigned long const, service::storage_proxy::rh_entry> >, std::__detail::_Select1st, std::equal_to<unsigned long>, std::hash<unsigned long>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_erase > | | | | service::storage_proxy::got_response > | | | | _ZN12continuationIZN6futureIJEE4thenIZZN7service13storage_proxy22send_to_live_endpointsEmENKUlRSt4pairIK13basic_sstringIcjLj15EESt6vectorIN3gms12inet_addressESaISB_EEEE_clESF_EUlvE_S1_EET0_OT_EUlSK_E_JEE3runEv > | | | | reactor::del_timer > | | | | 0x6100000e2040 > | | | | > | | | |--10.00%-- cql3::statements::modification_statement::get_mutations > | | | | > | | | |--10.00%-- cql3::statements::modification_statement::build_partition_keys > | | | | cql3::statements::modification_statement::create_exploded_clustering_prefix > | | | | 0x60c014be0b00 > | | | | > | | | |--10.00%-- mutation_partition::~mutation_partition > | | | | std::vector<mutation, std::allocator<mutation> >::~vector > | | | | service::storage_proxy::mutate_with_triggers > | | | | cql3::statements::modification_statement::execute_without_condition > | | | | cql3::statements::modification_statement::execute > | | | | cql3::query_processor::process_statement > | | | | transport::cql_server::connection::process_execute > | | | | transport::cql_server::connection::process_request_one > | | | | futurize<future<std::pair<foreign_ptr<shared_ptr<transport::cql_server::response> >, service::client_state> > >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}::operator()(temporary_buffer) const::{lambda()#1}::operator()()::{lambda()#1}&> > | | | | futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}, temporary_buffer> > | | | | futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}, future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> > > > | | | | 0x8961de > | | | | > | | | --10.00%-- object_deleter_impl<deleter>::~object_deleter_impl > | | | _ZN12continuationIZN6futureIJEE12then_wrappedIZZNS1_7finallyIZ7do_withI11foreign_ptrI10shared_ptrIN9transport10cql_server8responseEEEZZNS8_10connection14write_responseEOSB_ENUlvE_clEvEUlRT_E_EDaOSF_OT0_EUlvE_EES1_SI_ENUlS1_E_clES1_EUlSF_E_S1_EESJ_SI_EUlSI_E_JEED0Ev > | | | 0x61a0000c3db0 > | | | > | | |--1.34%-- dht::decorated_key::equal > | | | | > | | | |--83.33%-- 0x607000138f00 > | | | | > | | | --16.67%-- 0x60a0000e0f40 > | | | > | | |--1.34%-- service::storage_proxy::send_to_live_endpoints > | | | > | | |--1.19%-- transport::cql_server::connection::process_execute > | | | transport::cql_server::connection::process_request_one > | | | futurize<future<std::pair<foreign_ptr<shared_ptr<transport::cql_server::response> >, service::client_state> > >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}::operator()(temporary_buffer) const::{lambda()#1}::operator()()::{lambda()#1}&> > | | | futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}, temporary_buffer> > | | | futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}, future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> > > > | | | | > | | | |--87.50%-- transport::cql_server::connection::process_request > | | | | do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&> > | | | | do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> > > | | | | do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> > > | | | | 0x60e0000c3000 > | | | | > | | | --12.50%-- 0x8961de > | | | > | | |--1.19%-- reactor::run > | | | | > | | | |--87.50%-- smp::configure(boost::program_options::variables_map)::{lambda()#1}::operator() > | | | | continuation<future<temporary_buffer<char> > future<>::then<future<temporary_buffer<char> > file::dma_read_bulk<char>(unsigned long, unsigned long)::{lambda(unsigned long)#1}::operator()(unsigned long)::{lambda()#3}, future<temporary_buffer<char> > >(future<temporary_buffer<char> > file::dma_read_bulk<char>(unsigned long, unsigned long)::{lambda(unsigned long)#1}::operator()(unsigned long)::{lambda()#3}&&)::{lambda(future<temporary_buffer<char> >)#1}>::run > | | | | 0x600000043d00 > | | | | > | | | --12.50%-- app_template::run_deprecated > | | | main > | | | __libc_start_main > | | | _GLOBAL__sub_I__ZN3org6apache9cassandra21g_cassandra_constantsE > | | | 0x7f4ae20c9fa0 > | | | > | | |--1.04%-- __clock_gettime > | | | std::chrono::_V2::system_clock::now > | | | | > | | | |--42.86%-- reactor::run > | | | | smp::configure(boost::program_options::variables_map)::{lambda()#1}::operator() > | | | | continuation<future<temporary_buffer<char> > future<>::then<future<temporary_buffer<char> > file::dma_read_bulk<char>(unsigned long, unsigned long)::{lambda(unsigned long)#1}::operator()(unsigned long)::{lambda()#3}, future<temporary_buffer<char> > >(future<temporary_buffer<char> > file::dma_read_bulk<char>(unsigned long, unsigned long)::{lambda(unsigned long)#1}::operator()(unsigned long)::{lambda()#3}&&)::{lambda(future<temporary_buffer<char> >)#1}>::run > | | | | 0x600000043d00 > | | | | > | | | |--14.29%-- 0xa63209 > | | | | > | | | |--14.29%-- continuation<future<> future<>::finally<auto do_with<std::vector<frozen_mutation, std::allocator<frozen_mutation> >, shared_ptr<service::storage_proxy>, service::storage_proxy::init_messaging_service()::{lambda(std::vector<frozen_mutation, std::allocator<frozen_mutation> >)#1}::operator()(std::vector<frozen_mutation, std::allocator<frozen_mutation> >) const::{lambda(std::vector<frozen_mutation, std::allocator<frozen_mutation> > const&, shared_ptr<service::storage_proxy>&)#1}>(std::vector<frozen_mutation, std::allocator<frozen_mutation> >&&, shared_ptr<service::storage_proxy>&&, service::storage_proxy::init_messaging_service()::{lambda(std::vector<frozen_mutation, std::allocator<frozen_mutation> >)#1}::operator()(std::vector<frozen_mutation, std::allocator<frozen_mutation> >) const::{lambda(std::vector<frozen_mutation, std::allocator<frozen_mutation> > const&, shared_ptr<service::storage_proxy>&)#1}&&):: {lambda()#1}>(service::storage_proxy::init_messaging_service()::{lambda(std::vector<frozen_mutation, std::a > | | | | 0x2b7434 > | | | | > | | | |--14.29%-- _ZN8futurizeI6futureIJSt10unique_ptrIN4cql317update_parametersESt14default_deleteIS3_EEEEE5applyIZNS2_10statements22modification_statement22make_update_parametersERN7seastar7shardedIN7service13storage_proxyEEE13lw_shared_ptrISt6vectorI13partition_keySaISK_EEESI_I26exploded_clustering_prefixERKNS2_13query_optionsEblEUlT_E_JNSt12experimental15fundamentals_v18optionalINS3_13prefetch_dataEEEEEES7_OST_OSt5tupleIJDpT0_EE > | | | | cql3::statements::modification_statement::make_update_parameters > | | | | cql3::statements::modification_statement::get_mutations > | | | | cql3::statements::modification_statement::execute_without_condition > | | | | cql3::query_options::query_options > | | | | 0x7f4ad6bf80e0 > | | | | > | | | --14.29%-- database::apply_in_memory > | | | database::do_apply > | | | _ZN12continuationIZN6futureIJEE4thenIZN8database5applyERK15frozen_mutationEUlvE_S1_EET0_OT_EUlSA_E_JEE3runEv > | | | reactor::del_timer > | | | 0x6090000e2040 > | | | > | | |--1.04%-- memory::small_pool::allocate > | | | | > | | | |--14.29%-- 0x5257c379469d9 > | | | | > | | | |--14.29%-- 0x609002b9fe98 > | | | | > | | | |--14.29%-- 0x13c8b90 > | | | | > | | | |--14.29%-- 0x60f000190710 > | | | | > | | | |--14.29%-- 0x25 > | | | | > | | | |--14.29%-- 0x7f4ad6bf84c0 > | | | | > | | | --14.29%-- 0x7f4ad53f81f0 > | | | > | | |--0.89%-- db::serializer<atomic_cell_view>::serializer > | | | mutation_partition_serializer::write_without_framing > | | | frozen_mutation::frozen_mutation > | | | frozen_mutation::frozen_mutation > | | | > | | |--0.89%-- do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&> > | | | do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> > > | | | do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> > > | | | 0x60f0000c3000 > | | | > | | |--0.89%-- futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}, temporary_buffer> > | | | futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}, future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> > > > | | | transport::cql_server::connection::process_request > | | | do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&> > | | | do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> > > | | | do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> > > | | | | > | | | |--83.33%-- 0x6090000c3000 > | | | | > | | | --16.67%-- 0x600000044400 > | | | > | | |--0.89%-- std::_Function_handler<void (), reactor::run()::{lambda()#8}>::_M_invoke > | | | | > | | | |--50.00%-- reactor::run > | | | | smp::configure(boost::program_options::variables_map)::{lambda()#1}::operator() > | | | | continuation<future<temporary_buffer<char> > future<>::then<future<temporary_buffer<char> > file::dma_read_bulk<char>(unsigned long, unsigned long)::{lambda(unsigned long)#1}::operator()(unsigned long)::{lambda()#3}, future<temporary_buffer<char> > >(future<temporary_buffer<char> > file::dma_read_bulk<char>(unsigned long, unsigned long)::{lambda(unsigned long)#1}::operator()(unsigned long)::{lambda()#3}&&)::{lambda(future<temporary_buffer<char> >)#1}>::run > | | | | 0x600000043d00 > | | | | > | | | --50.00%-- reactor::signals::signal_handler::signal_handler > | | | 0x3e8 > | | | > | | |--0.74%-- db::commitlog::segment::allocate > | | | | > | | | --100.00%-- db::commitlog::add > | | | database::do_apply > | | | | > | | | |--75.00%-- database::apply > | | | | smp_message_queue::async_work_item<future<> seastar::sharded<database>::invoke_on<service::storage_proxy::mutate_locally(mutation const&)::{lambda(database&)#1}, future<> >(unsigned int, service::storage_proxy::mutate_locally(mutation const&)::{lambda(database&)#1}&&)::{lambda()#1}>::process > | | | | smp_message_queue::process_queue<2ul, smp_message_queue::process_incoming()::{lambda(smp_message_queue::work_item*)#1}> > | | | | boost::lockfree::detail::ringbuffer_base<smp_message_queue::work_item*>::pop > | | | | boost::program_options::variables_map::get > | | | | > | | | --25.00%-- _ZN12continuationIZN6futureIJEE4thenIZN8database5applyERK15frozen_mutationEUlvE_S1_EET0_OT_EUlSA_E_JEE3runEv > | | | reactor::del_timer > | | | 0x60b0000e2040 > | | | > | | |--0.74%-- service::storage_proxy::create_write_response_handler > | | | > | | |--0.74%-- transport::cql_server::connection::process_request_one > | | | futurize<future<std::pair<foreign_ptr<shared_ptr<transport::cql_server::response> >, service::client_state> > >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}::operator()(temporary_buffer) const::{lambda()#1}::operator()()::{lambda()#1}&> > | | | futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}, temporary_buffer> > | | | futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}, future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> > > > | | | | > | | | |--80.00%-- transport::cql_server::connection::process_request > | | | | do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&> > | | | | do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> > > | | | | do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> > > | | | | 0x60a0000c3000 > | | | | > | | | --20.00%-- 0x8961de > | | | > | | |--0.74%-- compound_type<(allow_prefixes)0>::compare > | | | | > | | | |--20.00%-- 0x6030056c0f20 > | | | | > | | | |--20.00%-- boost::intrusive::bstbase2<boost::intrusive::mhtraits<rows_entry, boost::intrusive::set_member_hook<void, void, void, void>, &rows_entry::_link>, rows_entry::compare, (boost::intrusive::algo_types)5, boost::intrusive::detail::default_header_holder<boost::intrusive::rbtree_node_traits<void*, false> > >::find > | | | | mutation_partition::clustered_row > | | | | mutation::set_clustered_cell > | | | | cql3::constants::setter::execute > | | | | cql3::statements::update_statement::add_update_for_key > | | | | _ZN8futurizeI6futureIJSt6vectorI8mutationSaIS2_EEEEE5applyIZN4cql310statements22modification_statement13get_mutationsERN7seastar7shardedIN7service13storage_proxyEEERKNS8_13query_optionsEblEUlT_E_JSt10unique_ptrINS8_17update_parametersESt14default_deleteISN_EEEEES5_OSK_OSt5tupleIJDpT0_EE > | | | | cql3::statements::modification_statement::get_mutations > | | | | cql3::statements::modification_statement::execute_without_condition > | | | | cql3::query_options::query_options > | | | | 0x7f4adb3f80e0 > | | | | > | | | |--20.00%-- compound_type<(allow_prefixes)0>::compare > | | | | > | | | |--20.00%-- mutation_partition::clustered_row > | | | | boost::intrusive::bstree_impl<boost::intrusive::mhtraits<rows_entry, boost::intrusive::set_member_hook<void, void, void, void>, &rows_entry::_link>, rows_entry::compare, unsigned long, true, (boost::intrusive::algo_types)5, boost::intrusive::detail::default_header_holder<boost::intrusive::rbtree_node_traits<void*, false> > >::insert_unique > | | | | boost::intrusive::bstree_algorithms<boost::intrusive::rbtree_node_traits<void*, false> >::prev_node > | | | | 0x12d > | | | | > | | | --20.00%-- 0x60f00052daf0 > | | | > | | |--0.74%-- __memmove_ssse3_back > | | | | > | | | |--40.00%-- output_stream<char>::write > | | | | | > | | | | |--50.00%-- transport::cql_server::response::output > | | | | | futurize<future<> >::apply<transport::cql_server::connection::write_response(foreign_ptr<shared_ptr<transport::cql_server::response> >&&)::{lambda()#1}> > | | | | | > | | | | --50.00%-- 0x7c7fb2 > | | | | 0x5257c37847fa0 > | | | | > | | | |--20.00%-- transport::cql_server::connection::read_short_bytes > | | | | transport::cql_server::connection::process_query > | | | | 0x7f4ada7f86f0 > | | | | > | | | |--20.00%-- transport::cql_server::response::output > | | | | futurize<future<> >::apply<transport::cql_server::connection::write_response(foreign_ptr<shared_ptr<transport::cql_server::response> >&&)::{lambda()#1}> > | | | | 0x2 > | | | | > | | | --20.00%-- smp_message_queue::flush_response_batch > | | | boost::lockfree::detail::ringbuffer_base<smp_message_queue::work_item*>::pop > | | | boost::program_options::variables_map::get > | | | > | | |--0.74%-- syscall_work_queue::work_item_returning<syscall_result_extra<stat>, reactor::file_size(basic_sstring<char, unsigned int, 15u>)::{lambda()#1}>::~work_item_returning > | | | | > | | | |--60.00%-- 0x6130000c3000 > | | | | > | | | |--20.00%-- 0x608001fe59a0 > | | | | > | | | --20.00%-- 0x16 > | | | > | | |--0.74%-- __memset_sse2 > | | | | > | | | |--40.00%-- std::_Hashtable<range<dht::token>, std::pair<range<dht::token> const, std::unordered_set<gms::inet_address, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, std::allocator<gms::inet_address> > >, std::allocator<std::pair<range<dht::token> const, std::unordered_set<gms::inet_address, std::hash<gms::inet_address>, std::equal_to<gms::inet_address>, std::allocator<gms::inet_address> > > >, std::__detail::_Select1st, std::equal_to<range<dht::token> >, std::hash<range<dht::token> >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable > | | | | locator::token_metadata::pending_endpoints_for > | | | | service::storage_proxy::create_write_response_handler > | | | | service::storage_proxy::mutate_prepare<std::vector<mutation, std::allocator<mutation> >, service::storage_proxy::mutate_prepare(std::vector<mutation, std::allocator<mutation> >&, db::consistency_level, db::write_type)::{lambda(mutation const&, db::consistency_level, db::write_type)#1}> > | | | | service::storage_proxy::mutate > | | | | service::storage_proxy::mutate_with_triggers > | | | | cql3::statements::modification_statement::execute_without_condition > | | | | cql3::statements::modification_statement::execute > | | | | cql3::query_processor::process_statement > | | | | transport::cql_server::connection::process_execute > | | | | transport::cql_server::connection::process_request_one > | | | | futurize<future<std::pair<foreign_ptr<shared_ptr<transport::cql_server::response> >, service::client_state> > >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}::operator()(temporary_buffer) const::{lambda()#1}::operator()()::{lambda()#1}&> > | | | | futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}, temporary_buffer> > | | | | futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}, future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> > > > | | | | transport::cql_server::connection::process_request > | | | | do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&> > | | | | do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> > > | | | | do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> > > | | | | 0x6020000c3000 > | | | | > | | | |--40.00%-- service::digest_read_resolver::~digest_read_resolver > | | | | | > | | | | --100.00%-- 0x610002612b50 > | | | | > | | | --20.00%-- std::_Hashtable<basic_sstring<char, unsigned int, 15u>, std::pair<basic_sstring<char, unsigned int, 15u> const, std::vector<gms::inet_address, std::allocator<gms::inet_address> > >, std::allocator<std::pair<basic_sstring<char, unsigned int, 15u> const, std::vector<gms::inet_address, std::allocator<gms::inet_address> > > >, std::__detail::_Select1st, std::equal_to<basic_sstring<char, unsigned int, 15u> >, std::hash<basic_sstring<char, unsigned int, 15u> >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable > | | | service::storage_proxy::send_to_live_endpoints > | | | parallel_for_each<__gnu_cxx::__normal_iterator<unsigned long*, std::vector<unsigned long, std::allocator<unsigned long> > >, service::storage_proxy::mutate_begin(std::vector<unsigned long, std::allocator<unsigned long> >, db::consistency_level)::{lambda(unsigned long)#1}> > | | | service::storage_proxy::mutate > | | | service::storage_proxy::mutate_with_triggers > | | | cql3::statements::modification_statement::execute_without_condition > | | | cql3::statements::modification_statement::execute > | | | cql3::query_processor::process_statement > | | | transport::cql_server::connection::process_execute > | | | transport::cql_server::connection::process_request_one > | | | futurize<future<std::pair<foreign_ptr<shared_ptr<transport::cql_server::response> >, service::client_state> > >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}::operator()(temporary_buffer) const::{lambda()#1}::operator()()::{lambda()#1}&> > | | | futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}, temporary_buffer> > | | | futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}, future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> > > > | | | transport::cql_server::connection::process_request > | | | do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&> > | | | do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> > > | | | do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> > > | | | 0x6070000c3000 > | | | > | | |--0.74%-- reactor::del_timer > | | | | > | | | |--80.00%-- 0x60a0000e2040 > | | | | > | | | --20.00%-- 0x6080000c3db0 > | | | > | | |--0.59%-- unimplemented::operator<< > | | | | > | | | |--25.00%-- _ZN12continuationIZN6futureIJ10shared_ptrIN9transport8messages14result_messageEEEE4thenIZN4cql315query_processor17process_statementES1_INS8_13cql_statementEERN7service11query_stateERKNS8_13query_optionsEEUlT_E_S6_EET0_OSI_EUlSL_E_JS5_EED2Ev > | | | | 0x600100000008 > | | | | > | | | |--25.00%-- floating_type_impl<float>::from_string > | | | | > | | | |--25.00%-- 0x60e0000e4c10 > | | | | > | | | --25.00%-- _ZN12continuationIZN6futureIJ10shared_ptrIN9transport8messages14result_messageEEEE4thenIZN4cql315query_processor17process_statementES1_INS8_13cql_statementEERN7service11query_stateERKNS8_13query_optionsEEUlT_E_S6_EET0_OSI_EUlSL_E_JS5_EED2Ev > | | | 0x600100000008 > | | | > | | |--0.59%-- std::_Hashtable<unsigned long, std::pair<unsigned long const, service::storage_proxy::rh_entry>, std::allocator<std::pair<unsigned long const, service::storage_proxy::rh_entry> >, std::__detail::_Select1st, std::equal_to<unsigned long>, std::hash<unsigned long>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_insert_unique_node > | | | service::storage_proxy::register_response_handler > | | | service::storage_proxy::create_write_response_handler > | | | service::storage_proxy::create_write_response_handler > | | | service::storage_proxy::mutate_prepare<std::vector<mutation, std::allocator<mutation> >, service::storage_proxy::mutate_prepare(std::vector<mutation, std::allocator<mutation> >&, db::consistency_level, db::write_type)::{lambda(mutation const&, db::consistency_level, db::write_type)#1}> > | | | service::storage_proxy::mutate > | | | service::storage_proxy::mutate_with_triggers > | | | cql3::statements::modification_statement::execute_without_condition > | | | cql3::statements::modification_statement::execute > | | | cql3::query_processor::process_statement > | | | transport::cql_server::connection::process_execute > | | | transport::cql_server::connection::process_request_one > | | | futurize<future<std::pair<foreign_ptr<shared_ptr<transport::cql_server::response> >, service::client_state> > >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}::operator()(temporary_buffer) const::{lambda()#1}::operator()()::{lambda()#1}&> > | | | futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}, temporary_buffer> > | | | futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}, future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> > > > | | | transport::cql_server::connection::process_request > | | | do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&> > | | | do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> > > | | | do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> > > | | | 0x60b0000c3000 > | | | > | | |--0.59%-- mutation::set_clustered_cell > | | | | > | | | |--75.00%-- 0xa > | | | | > | | | --25.00%-- cql3::constants::setter::execute > | | | cql3::statements::update_statement::add_update_for_key > | | | _ZN8futurizeI6futureIJSt6vectorI8mutationSaIS2_EEEEE5applyIZN4cql310statements22modification_statement13get_mutationsERN7seastar7shardedIN7service13storage_proxyEEERKNS8_13query_optionsEblEUlT_E_JSt10unique_ptrINS8_17update_parametersESt14default_deleteISN_EEEEES5_OSK_OSt5tupleIJDpT0_EE > | | | cql3::statements::modification_statement::get_mutations > | | | cql3::statements::modification_statement::execute_without_condition > | | | cql3::query_options::query_options > | | | 0x7f4ad89f80e0 > | | | > | | |--0.59%-- memory::small_pool::small_pool > | | | | > | | | |--25.00%-- memory::stats > | | | | boost::program_options::variables_map::get > | | | | > | | | |--25.00%-- memory::reclaimer::~reclaimer > | | | | 0x1e > | | | | > | | | |--25.00%-- memory::allocate_aligned > | | | | > | | | --25.00%-- memory::small_pool::add_more_objects > | | | memory::small_pool::add_more_objects > | | | 0x6100000e0310 > | | | > | | |--0.59%-- __memcpy_sse2_unaligned > | | | | > | | | |--50.00%-- mutation_partition_applier::accept_row_cell > | | | | mutation_partition_view::accept > | | | | boost::intrusive::bstree_algorithms<boost::intrusive::rbtree_node_traits<void*, false> >::prev_node > | | | | 0x12d > | | | | > | | | |--25.00%-- scanning_reader::operator() > | | | | sstables::sstable::do_write_components > | | | | sstables::sstable::prepare_write_components > | | | | std::_Function_handler<void (), futurize<std::result_of<std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type ()>::type>::type seastar::async<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>(std::decay&&, (std::decay<sstables::sstable::write_components(mutation_reader, unsigned long, lw_shared_ptr<schema const>, unsigned long)::{lambda()#1}>::type&&)...)::{lambda(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type>::type, std::decay<{lambda()#1}>::type&&)::work&)#1}::operator()(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::type> seastar::async<{lambda()#1}>(futurize<std::result_of<std::decay<{lambda()#1}>::type ()>::typ e>::type, std::decay<{lambda()#1}>::type&&)::work)::{lambda()#1}>::_M_invoke > | | | | _GLOBAL__sub_I__ZN12app_templateC2Ev > | | | | > | | | --25.00%-- memtable::find_or_create_partition_slow > | | | memtable::apply > | | | database::apply_in_memory > | | | database::do_apply > | | | database::apply > | | | smp_message_queue::async_work_item<future<> seastar::sharded<database>::invoke_on<service::storage_proxy::mutate_locally(mutation const&)::{lambda(database&)#1}, future<> >(unsigned int, service::storage_proxy::mutate_locally(mutation const&)::{lambda(database&)#1}&&)::{lambda()#1}>::process > | | | smp_message_queue::process_queue<2ul, smp_message_queue::process_incoming()::{lambda(smp_message_queue::work_item*)#1}> > | | | boost::lockfree::detail::ringbuffer_base<smp_message_queue::work_item*>::pop > | | | boost::program_options::variables_map::get > | | | > | | |--0.59%-- smp_message_queue::flush_response_batch > | | | | > | | | |--25.00%-- boost::lockfree::detail::ringbuffer_base<smp_message_queue::work_item*>::pop > | | | | boost::program_options::variables_map::get > | | | | > | | | |--25.00%-- 0x13 > | | | | > | | | |--25.00%-- 0x7f4ad5ff8f40 > | | | | > | | | --25.00%-- reactor::run > | | | smp::configure(boost::program_options::variables_map)::{lambda()#1}::operator() > | | | continuation<future<temporary_buffer<char> > future<>::then<future<temporary_buffer<char> > file::dma_read_bulk<char>(unsigned long, unsigned long)::{lambda(unsigned long)#1}::operator()(unsigned long)::{lambda()#3}, future<temporary_buffer<char> > >(future<temporary_buffer<char> > file::dma_read_bulk<char>(unsigned long, unsigned long)::{lambda(unsigned long)#1}::operator()(unsigned long)::{lambda()#3}&&)::{lambda(future<temporary_buffer<char> >)#1}>::run > | | | 0x600000043d00 > | | --54.38%-- [...] > | | > | |--14.26%-- schedule_timeout > | | | > | | |--38.52%-- wait_for_completion > | | | | > | | | |--90.07%-- flush_work > | | | | xlog_cil_force_lsn > | | | | | > | | | | |--96.85%-- _xfs_log_force_lsn > | | | | | | > | | | | | |--79.67%-- xfs_file_fsync > | | | | | | vfs_fsync_range > | | | | | | do_fsync > | | | | | | sys_fdatasync > | | | | | | entry_SYSCALL_64_fastpath > | | | | | | | > | | | | | | --100.00%-- 0x7f4ade4212ad > | | | | | | syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process > | | | | | | 0x6030000c3ec0 > | | | | | | > | | | | | --20.33%-- xfs_dir_fsync > | | | | | vfs_fsync_range > | | | | | do_fsync > | | | | | sys_fdatasync > | | | | | entry_SYSCALL_64_fastpath > | | | | | | > | | | | | --100.00%-- 0x7f4ade4212ad > | | | | | syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process > | | | | | 0x6040000c3ec0 > | | | | | > | | | | --3.15%-- _xfs_log_force > | | | | xfs_log_force > | | | | xfs_buf_lock > | | | | _xfs_buf_find > | | | | xfs_buf_get_map > | | | | xfs_trans_get_buf_map > | | | | xfs_btree_get_bufl > | | | | xfs_bmap_extents_to_btree > | | | | xfs_bmap_add_extent_hole_real > | | | | xfs_bmapi_write > | | | | xfs_iomap_write_direct > | | | | __xfs_get_blocks > | | | | xfs_get_blocks_direct > | | | | do_blockdev_direct_IO > | | | | __blockdev_direct_IO > | | | | xfs_vm_direct_IO > | | | | xfs_file_dio_aio_write > | | | | xfs_file_write_iter > | | | | aio_run_iocb > | | | | do_io_submit > | | | | sys_io_submit > | | | | entry_SYSCALL_64_fastpath > | | | | io_submit > | | | | 0x46d98a > | | | | > | | | --9.93%-- submit_bio_wait > | | | blkdev_issue_flush > | | | xfs_blkdev_issue_flush > | | | xfs_file_fsync > | | | vfs_fsync_range > | | | do_fsync > | | | sys_fdatasync > | | | entry_SYSCALL_64_fastpath > | | | | > | | | --100.00%-- 0x7f4ade4212ad > | | | syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process > | | | 0x6030000c3ec0 > | | | > | | |--32.79%-- io_schedule_timeout > | | | bit_wait_io > | | | __wait_on_bit > | | | | > | | | |--51.67%-- wait_on_page_bit > | | | | | > | | | | |--95.16%-- filemap_fdatawait_range > | | | | | filemap_write_and_wait_range > | | | | | xfs_file_fsync > | | | | | vfs_fsync_range > | | | | | do_fsync > | | | | | sys_fdatasync > | | | | | entry_SYSCALL_64_fastpath > | | | | | 0x7f4ade4212ad > | | | | | syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process > | | | | | 0x60b0000c3ec0 > | | | | | > | | | | --4.84%-- __migration_entry_wait > | | | | migration_entry_wait > | | | | handle_mm_fault > | | | | __do_page_fault > | | | | do_page_fault > | | | | page_fault > | | | | std::_Function_handler<void (), httpd::http_server::_date_format_timer::{lambda()#1}>::_M_invoke > | | | | | > | | | | --100.00%-- service::storage_proxy::mutate_prepare<std::vector<mutation, std::allocator<mutation> >, service::storage_proxy::mutate_prepare(std::vector<mutation, std::allocator<mutation> >&, db::consistency_level, db::write_type)::{lambda(mutation const&, db::consistency_level, db::write_type)#1}> > | | | | service::storage_proxy::mutate > | | | | service::storage_proxy::mutate_with_triggers > | | | | cql3::statements::modification_statement::execute_without_condition > | | | | cql3::statements::modification_statement::execute > | | | | cql3::query_processor::process_statement > | | | | transport::cql_server::connection::process_execute > | | | | transport::cql_server::connection::process_request_one > | | | | futurize<future<std::pair<foreign_ptr<shared_ptr<transport::cql_server::response> >, service::client_state> > >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}::operator()(temporary_buffer) const::{lambda()#1}::operator()()::{lambda()#1}&> > | | | | futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}::operator()(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&) const::{lambda(temporary_buffer<char>)#1}, temporary_buffer> > | | | | futurize<future<> >::apply<transport::cql_server::connection::process_request()::{lambda(future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> >&&)#1}, future<std::experimental::fundamentals_v1::optional<transport::cql_binary_frame_v3> > > > | | | | transport::cql_server::connection::process_request > | | | | do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&> > | | | | do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> > > | | | | do_void_futurize_apply<void do_until_continued<transport::cql_server::connection::process()::{lambda()#2}, transport::cql_server::connection::process()::{lambda()#1}&>(transport::cql_server::connection::process()::{lambda()#1}&, transport::cql_server::connection::process()::{lambda()#2}&&, promise<>)::{lambda(future<>)#1}, promise<> > > | | | | 0x6140000c3000 > | | | | > | | | --48.33%-- out_of_line_wait_on_bit > | | | block_truncate_page > | | | xfs_setattr_size > | | | xfs_vn_setattr > | | | notify_change > | | | do_truncate > | | | do_sys_ftruncate.constprop.15 > | | | sys_ftruncate > | | | entry_SYSCALL_64_fastpath > | | | __GI___ftruncate64 > | | | syscall_work_queue::work_item_returning<syscall_result_extra<stat>, posix_file_impl::stat()::{lambda()#1}>::process > | | | | > | | | |--13.79%-- 0x7f4ad29ff700 > | | | | > | | | |--13.79%-- 0x7f4acdbff700 > | | | | > | | | |--12.07%-- 0x7f4ad05ff700 > | | | | > | | | |--12.07%-- 0x7f4acedff700 > | | | | > | | | |--10.34%-- 0x7f4ad0bff700 > | | | | > | | | |--6.90%-- 0x7f4ad2fff700 > | | | | > | | | |--6.90%-- 0x7f4ad11ff700 > | | | | > | | | |--6.90%-- 0x7f4acf9ff700 > | | | | > | | | |--6.90%-- 0x7f4acf3ff700 > | | | | > | | | |--6.90%-- 0x7f4ace7ff700 > | | | | > | | | |--1.72%-- 0x7f4ad17ff700 > | | | | > | | | --1.72%-- 0x7f4aca5ff700 > | | | > | | --28.69%-- __down > | | down > | | xfs_buf_lock > | | _xfs_buf_find > | | xfs_buf_get_map > | | | > | | |--97.14%-- xfs_buf_read_map > | | | xfs_trans_read_buf_map > | | | | > | | | |--98.04%-- xfs_read_agf > | | | | xfs_alloc_read_agf > | | | | xfs_alloc_fix_freelist > | | | | | > | | | | |--93.00%-- xfs_free_extent > | | | | | xfs_bmap_finish > | | | | | xfs_itruncate_extents > | | | | | | > | | | | | |--87.10%-- xfs_inactive_truncate > | | | | | | xfs_inactive > | | | | | | xfs_fs_evict_inode > | | | | | | evict > | | | | | | iput > | | | | | | __dentry_kill > | | | | | | dput > | | | | | | __fput > | | | | | | ____fput > | | | | | | task_work_run > | | | | | | do_notify_resume > | | | | | | int_signal > | | | | | | __libc_close > | | | | | | std::experimental::fundamentals_v1::bad_optional_access::~bad_optional_access > | | | | | | > | | | | | --12.90%-- xfs_setattr_size > | | | | | xfs_vn_setattr > | | | | | notify_change > | | | | | do_truncate > | | | | | do_sys_ftruncate.constprop.15 > | | | | | sys_ftruncate > | | | | | entry_SYSCALL_64_fastpath > | | | | | | > | | | | | --100.00%-- __GI___ftruncate64 > | | | | | syscall_work_queue::work_item_returning<syscall_result_extra<stat>, posix_file_impl::stat()::{lambda()#1}>::process > | | | | | | > | | | | | |--20.00%-- 0x7f4ad0bff700 > | | | | | | > | | | | | |--20.00%-- 0x7f4acedff700 > | | | | | | > | | | | | |--10.00%-- 0x7f4ad2fff700 > | | | | | | > | | | | | |--10.00%-- 0x7f4ad17ff700 > | | | | | | > | | | | | |--10.00%-- 0x7f4ad11ff700 > | | | | | | > | | | | | |--10.00%-- 0x7f4ad05ff700 > | | | | | | > | | | | | |--10.00%-- 0x7f4acf3ff700 > | | | | | | > | | | | | --10.00%-- 0x7f4acdbff700 > | | | | | > | | | | --7.00%-- xfs_alloc_vextent > | | | | xfs_bmap_btalloc > | | | | xfs_bmap_alloc > | | | | xfs_bmapi_write > | | | | xfs_iomap_write_direct > | | | | __xfs_get_blocks > | | | | xfs_get_blocks_direct > | | | | do_blockdev_direct_IO > | | | | __blockdev_direct_IO > | | | | xfs_vm_direct_IO > | | | | xfs_file_dio_aio_write > | | | | xfs_file_write_iter > | | | | aio_run_iocb > | | | | do_io_submit > | | | | sys_io_submit > | | | | entry_SYSCALL_64_fastpath > | | | | io_submit > | | | | 0x46d98a > | | | | > | | | --1.96%-- xfs_read_agi > | | | xfs_iunlink_remove > | | | xfs_ifree > | | | xfs_inactive_ifree > | | | xfs_inactive > | | | xfs_fs_evict_inode > | | | evict > | | | iput > | | | __dentry_kill > | | | dput > | | | __fput > | | | ____fput > | | | task_work_run > | | | do_notify_resume > | | | int_signal > | | | __libc_close > | | | std::experimental::fundamentals_v1::bad_optional_access::~bad_optional_access > | | | > | | --2.86%-- xfs_trans_get_buf_map > | | xfs_btree_get_bufl > | | xfs_bmap_extents_to_btree > | | xfs_bmap_add_extent_hole_real > | | xfs_bmapi_write > | | xfs_iomap_write_direct > | | __xfs_get_blocks > | | xfs_get_blocks_direct > | | do_blockdev_direct_IO > | | __blockdev_direct_IO > | | xfs_vm_direct_IO > | | xfs_file_dio_aio_write > | | xfs_file_write_iter > | | aio_run_iocb > | | do_io_submit > | | sys_io_submit > | | entry_SYSCALL_64_fastpath > | | io_submit > | | 0x46d98a > | | > | |--13.48%-- eventfd_ctx_read > | | eventfd_read > | | __vfs_read > | | vfs_read > | | sys_read > | | entry_SYSCALL_64_fastpath > | | 0x7f4ade6f754d > | | smp_message_queue::respond > | | 0xffffffffffffffff > | | > | |--7.83%-- md_flush_request > | | raid0_make_request > | | md_make_request > | | generic_make_request > | | submit_bio > | | | > | | |--92.54%-- submit_bio_wait > | | | blkdev_issue_flush > | | | xfs_blkdev_issue_flush > | | | xfs_file_fsync > | | | vfs_fsync_range > | | | do_fsync > | | | sys_fdatasync > | | | entry_SYSCALL_64_fastpath > | | | | > | | | --100.00%-- 0x7f4ade4212ad > | | | syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process > | | | 0x6010000c3ec0 > | | | > | | --7.46%-- _xfs_buf_ioapply > | | xfs_buf_submit > | | xlog_bdstrat > | | xlog_sync > | | xlog_state_release_iclog > | | | > | | |--73.33%-- _xfs_log_force_lsn > | | | xfs_file_fsync > | | | vfs_fsync_range > | | | do_fsync > | | | sys_fdatasync > | | | entry_SYSCALL_64_fastpath > | | | | > | | | --100.00%-- 0x7f4ade4212ad > | | | syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process > | | | 0x6080000c3ec0 > | | | > | | --26.67%-- _xfs_log_force > | | xfs_log_force > | | xfs_buf_lock > | | _xfs_buf_find > | | xfs_buf_get_map > | | xfs_trans_get_buf_map > | | xfs_btree_get_bufl > | | xfs_bmap_extents_to_btree > | | xfs_bmap_add_extent_hole_real > | | xfs_bmapi_write > | | xfs_iomap_write_direct > | | __xfs_get_blocks > | | xfs_get_blocks_direct > | | do_blockdev_direct_IO > | | __blockdev_direct_IO > | | xfs_vm_direct_IO > | | xfs_file_dio_aio_write > | | xfs_file_write_iter > | | aio_run_iocb > | | do_io_submit > | | sys_io_submit > | | entry_SYSCALL_64_fastpath > | | io_submit > | | 0x46d98a > | | > | |--5.53%-- _xfs_log_force_lsn > | | | > | | |--80.28%-- xfs_file_fsync > | | | vfs_fsync_range > | | | do_fsync > | | | sys_fdatasync > | | | entry_SYSCALL_64_fastpath > | | | | > | | | --100.00%-- 0x7f4ade4212ad > | | | syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process > | | | | > | | | |--97.92%-- 0x60d0000c3ec0 > | | | | > | | | |--1.04%-- 0x6020000c3ec0 > | | | | > | | | --1.04%-- 0x600000557ec0 > | | | > | | --19.72%-- xfs_dir_fsync > | | vfs_fsync_range > | | do_fsync > | | sys_fdatasync > | | entry_SYSCALL_64_fastpath > | | | > | | --100.00%-- 0x7f4ade4212ad > | | syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process > | | 0x6040000c3ec0 > | | > | |--1.25%-- rwsem_down_read_failed > | | call_rwsem_down_read_failed > | | | > | | |--90.62%-- xfs_ilock > | | | | > | | | |--86.21%-- xfs_ilock_data_map_shared > | | | | __xfs_get_blocks > | | | | xfs_get_blocks_direct > | | | | do_blockdev_direct_IO > | | | | __blockdev_direct_IO > | | | | xfs_vm_direct_IO > | | | | xfs_file_dio_aio_write > | | | | xfs_file_write_iter > | | | | aio_run_iocb > | | | | do_io_submit > | | | | sys_io_submit > | | | | entry_SYSCALL_64_fastpath > | | | | | > | | | | --100.00%-- io_submit > | | | | 0x46d98a > | | | | > | | | |--6.90%-- xfs_file_fsync > | | | | vfs_fsync_range > | | | | do_fsync > | | | | sys_fdatasync > | | | | entry_SYSCALL_64_fastpath > | | | | 0x7f4ade4212ad > | | | | syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process > | | | | 0x6090000c3ec0 > | | | | > | | | --6.90%-- xfs_dir_fsync > | | | vfs_fsync_range > | | | do_fsync > | | | sys_fdatasync > | | | entry_SYSCALL_64_fastpath > | | | 0x7f4ade4212ad > | | | syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process > | | | 0x6070000c3ec0 > | | | > | | --9.38%-- xfs_log_commit_cil > | | __xfs_trans_commit > | | xfs_trans_commit > | | | > | | |--33.33%-- xfs_setattr_size > | | | xfs_vn_setattr > | | | notify_change > | | | do_truncate > | | | do_sys_ftruncate.constprop.15 > | | | sys_ftruncate > | | | entry_SYSCALL_64_fastpath > | | | __GI___ftruncate64 > | | | syscall_work_queue::work_item_returning<syscall_result_extra<stat>, posix_file_impl::stat()::{lambda()#1}>::process > | | | 0x7f4acedff700 > | | | > | | |--33.33%-- xfs_vn_update_time > | | | file_update_time > | | | xfs_file_aio_write_checks > | | | xfs_file_dio_aio_write > | | | xfs_file_write_iter > | | | aio_run_iocb > | | | do_io_submit > | | | sys_io_submit > | | | entry_SYSCALL_64_fastpath > | | | io_submit > | | | 0x46d98a > | | | > | | --33.33%-- xfs_bmap_add_attrfork > | | xfs_attr_set > | | xfs_initxattrs > | | security_inode_init_security > | | xfs_init_security > | | xfs_generic_create > | | xfs_vn_mknod > | | xfs_vn_create > | | vfs_create > | | path_openat > | | do_filp_open > | | do_sys_open > | | sys_open > | | entry_SYSCALL_64_fastpath > | | 0x7f4ade6f7cdd > | | syscall_work_queue::work_item_returning<syscall_result<int>, reactor::open_file_dma(basic_sstring<char, unsigned int, 15u>, open_flags, file_open_options)::{lambda()#1}>::process > | | 0xffffffffffffffff > | | > | |--0.97%-- rwsem_down_write_failed > | | call_rwsem_down_write_failed > | | xfs_ilock > | | xfs_vn_update_time > | | file_update_time > | | xfs_file_aio_write_checks > | | xfs_file_dio_aio_write > | | xfs_file_write_iter > | | aio_run_iocb > | | do_io_submit > | | sys_io_submit > | | entry_SYSCALL_64_fastpath > | | io_submit > | | 0x46d98a > | | > | |--0.51%-- xlog_cil_force_lsn > | | | > | | |--92.31%-- _xfs_log_force_lsn > | | | | > | | | |--91.67%-- xfs_file_fsync > | | | | vfs_fsync_range > | | | | do_fsync > | | | | sys_fdatasync > | | | | entry_SYSCALL_64_fastpath > | | | | 0x7f4ade4212ad > | | | | syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process > | | | | 0x60b0000c3ec0 > | | | | > | | | --8.33%-- xfs_dir_fsync > | | | vfs_fsync_range > | | | do_fsync > | | | sys_fdatasync > | | | entry_SYSCALL_64_fastpath > | | | 0x7f4ade4212ad > | | | syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process > | | | 0x60d0000c3ec0 > | | | > | | --7.69%-- _xfs_log_force > | | xfs_log_force > | | xfs_buf_lock > | | _xfs_buf_find > | | xfs_buf_get_map > | | xfs_trans_get_buf_map > | | xfs_btree_get_bufl > | | xfs_bmap_extents_to_btree > | | xfs_bmap_add_extent_hole_real > | | xfs_bmapi_write > | | xfs_iomap_write_direct > | | __xfs_get_blocks > | | xfs_get_blocks_direct > | | do_blockdev_direct_IO > | | __blockdev_direct_IO > | | xfs_vm_direct_IO > | | xfs_file_dio_aio_write > | | xfs_file_write_iter > | | aio_run_iocb > | | do_io_submit > | | sys_io_submit > | | entry_SYSCALL_64_fastpath > | | io_submit > | | 0x46d98a > | --0.04%-- [...] > | > --3.82%-- preempt_schedule_common > | > |--99.02%-- _cond_resched > | | > | |--41.58%-- wait_for_completion > | | | > | | |--66.67%-- flush_work > | | | xlog_cil_force_lsn > | | | | > | | | |--96.43%-- _xfs_log_force_lsn > | | | | | > | | | | |--77.78%-- xfs_file_fsync > | | | | | vfs_fsync_range > | | | | | do_fsync > | | | | | sys_fdatasync > | | | | | entry_SYSCALL_64_fastpath > | | | | | 0x7f4ade4212ad > | | | | | syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process > | | | | | 0x6030000c3ec0 > | | | | | > | | | | --22.22%-- xfs_dir_fsync > | | | | vfs_fsync_range > | | | | do_fsync > | | | | sys_fdatasync > | | | | entry_SYSCALL_64_fastpath > | | | | | > | | | | --100.00%-- 0x7f4ade4212ad > | | | | syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process > | | | | 0x6030000c3ec0 > | | | | > | | | --3.57%-- _xfs_log_force > | | | xfs_log_force > | | | xfs_buf_lock > | | | _xfs_buf_find > | | | xfs_buf_get_map > | | | xfs_trans_get_buf_map > | | | xfs_btree_get_bufl > | | | xfs_bmap_extents_to_btree > | | | xfs_bmap_add_extent_hole_real > | | | xfs_bmapi_write > | | | xfs_iomap_write_direct > | | | __xfs_get_blocks > | | | xfs_get_blocks_direct > | | | do_blockdev_direct_IO > | | | __blockdev_direct_IO > | | | xfs_vm_direct_IO > | | | xfs_file_dio_aio_write > | | | xfs_file_write_iter > | | | aio_run_iocb > | | | do_io_submit > | | | sys_io_submit > | | | entry_SYSCALL_64_fastpath > | | | io_submit > | | | 0x46d98a > | | | > | | --33.33%-- submit_bio_wait > | | blkdev_issue_flush > | | xfs_blkdev_issue_flush > | | xfs_file_fsync > | | vfs_fsync_range > | | do_fsync > | | sys_fdatasync > | | entry_SYSCALL_64_fastpath > | | | > | | --100.00%-- 0x7f4ade4212ad > | | syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process > | | 0x6030000c3ec0 > | | > | |--33.66%-- flush_work > | | xlog_cil_force_lsn > | | | > | | |--97.06%-- _xfs_log_force_lsn > | | | | > | | | |--78.79%-- xfs_file_fsync > | | | | vfs_fsync_range > | | | | do_fsync > | | | | sys_fdatasync > | | | | entry_SYSCALL_64_fastpath > | | | | 0x7f4ade4212ad > | | | | syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process > | | | | 0x6030000c3ec0 > | | | | > | | | --21.21%-- xfs_dir_fsync > | | | vfs_fsync_range > | | | do_fsync > | | | sys_fdatasync > | | | entry_SYSCALL_64_fastpath > | | | | > | | | --100.00%-- 0x7f4ade4212ad > | | | syscall_work_queue::work_item_returning<syscall_result<int>, posix_file_impl::flush()::{lambda()#1}>::process > | | | 0x6030000c3ec0 > | | | > | | --2.94%-- _xfs_log_force > | | xfs_log_force > | | xfs_buf_lock > | | _xfs_buf_find > | | xfs_buf_get_map > | | xfs_trans_get_buf_map > | | xfs_btree_get_bufl > | | xfs_bmap_extents_to_btree > | | xfs_bmap_add_extent_hole_real > | | xfs_bmapi_write > | | xfs_iomap_write_direct > | | __xfs_get_blocks > | | xfs_get_blocks_direct > | | do_blockdev_direct_IO > | | __blockdev_direct_IO > | | xfs_vm_direct_IO > | | xfs_file_dio_aio_write > | | xfs_file_write_iter > | | aio_run_iocb > | | do_io_submit > | | sys_io_submit > | | entry_SYSCALL_64_fastpath > | | io_submit > | | 0x46d98a > | | > | |--13.86%-- lock_sock_nested > | | | > | | |--78.57%-- tcp_sendmsg > | | | inet_sendmsg > | | | sock_sendmsg > | | | SYSC_sendto > | | | sys_sendto > | | | entry_SYSCALL_64_fastpath > | | | __libc_send > | | | _ZN12continuationIZN6futureIJmEE4thenIZN7reactor14write_all_partER17pollable_fd_statePKvmmEUlmE_S0_IJEEEET0_OT_EUlSC_E_JmEE3runEv > | | | | > | | | |--36.36%-- 0x7f4ad6bf8de0 > | | | | > | | | |--9.09%-- 0x4 > | | | | > | | | |--9.09%-- 0x7f4adadf8de0 > | | | | > | | | |--9.09%-- 0x7f4ada1f8de0 > | | | | > | | | |--9.09%-- 0x7f4ad89f8de0 > | | | | > | | | |--9.09%-- 0x7f4ad83f8de0 > | | | | > | | | |--9.09%-- 0x7f4ad4df8de0 > | | | | > | | | --9.09%-- 0x7f4ad35f8de0 > | | | > | | --21.43%-- tcp_recvmsg > | | inet_recvmsg > | | sock_recvmsg > | | sock_read_iter > | | __vfs_read > | | vfs_read > | | sys_read > | | entry_SYSCALL_64_fastpath > | | 0x7f4ade6f754d > | | reactor::read_some > | | | > | | |--66.67%-- _ZN12continuationIZN6futureIJEE4thenIZZN7service13storage_proxy22send_to_live_endpointsEmENKUlRSt4pairIK13basic_sstringIcjLj15EESt6vectorIN3gms12inet_addressESaISB_EEEE_clESF_EUlvE_S1_EET0_OT_EUlSK_E_JEE3runEv > | | | reactor::del_timer > | | | 0x6160000e2040 > | | | > | | --33.33%-- continuation<future<> future<>::then_wrapped<future<> future<>::finally<auto seastar::with_gate<transport::cql_server::connection::process()::{lambda()#2}::operator()() const::{lambda()#1}>(seastar::gate&, transport::cql_server::connection::process()::{lambda()#2}::operator()() const::{lambda()#1}&&)::{lambda()#1}>(seastar::gate&)::{lambda(future<>)#1}::operator()(future<>)::{lambda(seastar::gate)#1}, future<> >(seastar::gate&)::{lambda(seastar::gate&)#1}>::run > | | reactor::del_timer > | | 0x6030000e2040 > | | > | |--3.96%-- generic_make_request_checks > | | generic_make_request > | | submit_bio > | | do_blockdev_direct_IO > | | __blockdev_direct_IO > | | xfs_vm_direct_IO > | | xfs_file_dio_aio_write > | | xfs_file_write_iter > | | aio_run_iocb > | | do_io_submit > | | sys_io_submit > | | entry_SYSCALL_64_fastpath > | | io_submit > | | 0x46d98a > | | > | |--3.96%-- kmem_cache_alloc_node > | | __alloc_skb > | | sk_stream_alloc_skb > | | tcp_sendmsg > | | inet_sendmsg > | | sock_sendmsg > | | SYSC_sendto > | | sys_sendto > | | entry_SYSCALL_64_fastpath > | | __libc_send > | | _ZN12continuationIZN6futureIJmEE4thenIZN7reactor14write_all_partER17pollable_fd_statePKvmmEUlmE_S0_IJEEEET0_OT_EUlSC_E_JmEE3runEv > | | | > | | |--25.00%-- 0x7f4ad9bf8de0 > | | | > | | |--25.00%-- 0x7f4ad7df8de0 > | | | > | | |--25.00%-- 0x7f4ad77f8de0 > | | | > | | --25.00%-- 0x7f4ad59f8de0 > | | > | |--0.99%-- unmap_underlying_metadata > | | do_blockdev_direct_IO > | | __blockdev_direct_IO > | | xfs_vm_direct_IO > | | xfs_file_dio_aio_write > | | xfs_file_write_iter > | | aio_run_iocb > | | do_io_submit > | | sys_io_submit > | | entry_SYSCALL_64_fastpath > | | io_submit > | | 0x46d98a > | | > | |--0.99%-- __kmalloc_node_track_caller > | | __kmalloc_reserve.isra.32 > | | __alloc_skb > | | sk_stream_alloc_skb > | | tcp_sendmsg > | | inet_sendmsg > | | sock_sendmsg > | | SYSC_sendto > | | sys_sendto > | | entry_SYSCALL_64_fastpath > | | __libc_send > | | _ZN12continuationIZN6futureIJmEE4thenIZN7reactor14write_all_partER17pollable_fd_statePKvmmEUlmE_S0_IJEEEET0_OT_EUlSC_E_JmEE3runEv > | | 0x7f4ad6bf8de0 > | | > | --0.99%-- task_work_run > | do_notify_resume > | int_signal > | __libc_close > | std::experimental::fundamentals_v1::bad_optional_access::~bad_optional_access > | > --0.98%-- __cond_resched_softirq > release_sock > tcp_sendmsg > inet_sendmsg > sock_sendmsg > SYSC_sendto > sys_sendto > entry_SYSCALL_64_fastpath > __libc_send > _ZN12continuationIZN6futureIJmEE4thenIZN7reactor14write_all_partER17pollable_fd_statePKvmmEUlmE_S0_IJEEEET0_OT_EUlSC_E_JmEE3runEv > 0x7f4ada1f8de0 > > > > # > # (For a higher level overview, try: perf report --sort comm,dso) > # > [164814.835933] CPU: 22 PID: 48042 Comm: scylla Tainted: G E 4.2.6-200.fc22.x86_64 #1 > [164814.835936] Hardware name: Xen HVM domU, BIOS 4.2.amazon 05/06/2015 > [164814.835937] 0000000000000000 00000000a8713b7a ffff8802fb977ab8 ffffffff817729ea > [164814.835941] 0000000000000000 ffff88076a69f780 ffff8802fb977ad8 ffffffffa03217a6 > [164814.835946] ffff88077119bcb0 0000000000000000 ffff8802fb977b08 ffffffffa034e749 > [164814.835951] Call Trace: > [164814.835954] [<ffffffff817729ea>] dump_stack+0x45/0x57 > [164814.835971] [<ffffffffa03217a6>] xfs_buf_stale+0x26/0x80 [xfs] > [164814.835989] [<ffffffffa034e749>] xfs_trans_binval+0x79/0x100 [xfs] > [164814.836001] [<ffffffffa02f479b>] xfs_bmap_btree_to_extents+0x12b/0x1a0 [xfs] > [164814.836012] [<ffffffffa02f8977>] xfs_bunmapi+0x967/0x9f0 [xfs] > [164814.836027] [<ffffffffa0334b9e>] xfs_itruncate_extents+0x10e/0x220 [xfs] > [164814.836044] [<ffffffffa033f75a>] ? kmem_zone_alloc+0x5a/0xe0 [xfs] > [164814.836084] [<ffffffffa0334d49>] xfs_inactive_truncate+0x99/0x110 [xfs] > [164814.836120] [<ffffffffa0335aa2>] xfs_inactive+0x102/0x120 [xfs] > [164814.836135] [<ffffffffa033a6cf>] xfs_fs_evict_inode+0x6f/0xa0 [xfs] > [164814.836138] [<ffffffff81238d76>] evict+0xa6/0x170 > [164814.836140] [<ffffffff81239026>] iput+0x196/0x220 > [164814.836147] [<ffffffff81234fe4>] __dentry_kill+0x174/0x1c0 > [164814.836150] [<ffffffff8123514b>] dput+0x11b/0x200 > [164814.836155] [<ffffffff8121fe02>] __fput+0x172/0x1e0 > [164814.836158] [<ffffffff8121febe>] ____fput+0xe/0x10 > [164814.836161] [<ffffffff810bab75>] task_work_run+0x85/0xb0 > [164814.836164] [<ffffffff81014a4d>] do_notify_resume+0x8d/0x90 > [164814.836167] [<ffffffff817795bc>] int_signal+0x12/0x17 > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-11-30 14:10 ` Brian Foster @ 2015-11-30 14:29 ` Avi Kivity 2015-11-30 16:14 ` Brian Foster 2015-11-30 15:49 ` Glauber Costa 1 sibling, 1 reply; 58+ messages in thread From: Avi Kivity @ 2015-11-30 14:29 UTC (permalink / raw) To: Brian Foster, Glauber Costa; +Cc: xfs On 11/30/2015 04:10 PM, Brian Foster wrote: >> 2) xfs_buf_lock -> down >> This is one I truly don't understand. What can be causing contention >> in this lock? We never have two different cores writing to the same >> buffer, nor should we have the same core doingCAP_FOWNER so. >> > This is not one single lock. An XFS buffer is the data structure used to > modify/log/read-write metadata on-disk and each buffer has its own lock > to prevent corruption. Buffer lock contention is possible because the > filesystem has bits of "global" metadata that has to be updated via > buffers. > > For example, usually one has multiple allocation groups to maximize > parallelism, but we still have per-ag metadata that has to be tracked > globally with respect to each AG (e.g., free space trees, inode > allocation trees, etc.). Any operation that affects this metadata (e.g., > block/inode allocation) has to lock the agi/agf buffers along with any > buffers associated with the modified btree leaf/node blocks, etc. > > One example in your attached perf traces has several threads looking to > acquire the AGF, which is a per-AG data structure for tracking free > space in the AG. One thread looks like the inode eviction case noted > above (freeing blocks), another looks like a file truncate (also freeing > blocks), and yet another is a block allocation due to a direct I/O > write. Were any of these operations directed to an inode in a separate > AG, they would be able to proceed in parallel (but I believe they would > still hit the same codepaths as far as perf can tell). I guess we can mitigate (but not eliminate) this by creating more allocation groups. What is the default value for agsize? Are there any downsides to decreasing it, besides consuming more memory? Are those locks held around I/O, or just CPU operations, or a mix? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-11-30 14:29 ` Avi Kivity @ 2015-11-30 16:14 ` Brian Foster 2015-12-01 9:08 ` Avi Kivity 0 siblings, 1 reply; 58+ messages in thread From: Brian Foster @ 2015-11-30 16:14 UTC (permalink / raw) To: Avi Kivity; +Cc: Glauber Costa, xfs On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote: > > > On 11/30/2015 04:10 PM, Brian Foster wrote: > >>2) xfs_buf_lock -> down > >>This is one I truly don't understand. What can be causing contention > >>in this lock? We never have two different cores writing to the same > >>buffer, nor should we have the same core doingCAP_FOWNER so. > >> > >This is not one single lock. An XFS buffer is the data structure used to > >modify/log/read-write metadata on-disk and each buffer has its own lock > >to prevent corruption. Buffer lock contention is possible because the > >filesystem has bits of "global" metadata that has to be updated via > >buffers. > > > >For example, usually one has multiple allocation groups to maximize > >parallelism, but we still have per-ag metadata that has to be tracked > >globally with respect to each AG (e.g., free space trees, inode > >allocation trees, etc.). Any operation that affects this metadata (e.g., > >block/inode allocation) has to lock the agi/agf buffers along with any > >buffers associated with the modified btree leaf/node blocks, etc. > > > >One example in your attached perf traces has several threads looking to > >acquire the AGF, which is a per-AG data structure for tracking free > >space in the AG. One thread looks like the inode eviction case noted > >above (freeing blocks), another looks like a file truncate (also freeing > >blocks), and yet another is a block allocation due to a direct I/O > >write. Were any of these operations directed to an inode in a separate > >AG, they would be able to proceed in parallel (but I believe they would > >still hit the same codepaths as far as perf can tell). > > I guess we can mitigate (but not eliminate) this by creating more allocation > groups. What is the default value for agsize? Are there any downsides to > decreasing it, besides consuming more memory? > I suppose so, but I would be careful to check that you actually see contention and test that increasing agcount actually helps. As mentioned, I'm not sure off hand if the perf trace alone would look any different if you have multiple metadata operations in progress on separate AGs. My understanding is that there are diminishing returns to high AG counts and usually 32-64 is sufficient for most storage. Dave might be able to elaborate more on that... (I think this would make a good FAQ entry, actually). The agsize/agcount mkfs-time heuristics change depending on the type of storage. A single AG can be up to 1TB and if the fs is not considered "multidisk" (e.g., no stripe unit/width is defined), 4 AGs is the default up to 4TB. If a stripe unit is set, the agsize/agcount is adjusted depending on the size of the overall volume (see xfsprogs-dev/mkfs/xfs_mkfs.c:calc_default_ag_geometry() for details). > Are those locks held around I/O, or just CPU operations, or a mix? I believe it's a mix of modifications and I/O, though it looks like some of the I/O cases don't necessarily wait on the lock. E.g., the AIL pushing case will trylock and defer to the next list iteration if the buffer is busy. Brian _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-11-30 16:14 ` Brian Foster @ 2015-12-01 9:08 ` Avi Kivity 2015-12-01 13:11 ` Brian Foster 0 siblings, 1 reply; 58+ messages in thread From: Avi Kivity @ 2015-12-01 9:08 UTC (permalink / raw) To: Brian Foster; +Cc: Glauber Costa, xfs On 11/30/2015 06:14 PM, Brian Foster wrote: > On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote: >> >> On 11/30/2015 04:10 PM, Brian Foster wrote: >>>> 2) xfs_buf_lock -> down >>>> This is one I truly don't understand. What can be causing contention >>>> in this lock? We never have two different cores writing to the same >>>> buffer, nor should we have the same core doingCAP_FOWNER so. >>>> >>> This is not one single lock. An XFS buffer is the data structure used to >>> modify/log/read-write metadata on-disk and each buffer has its own lock >>> to prevent corruption. Buffer lock contention is possible because the >>> filesystem has bits of "global" metadata that has to be updated via >>> buffers. >>> >>> For example, usually one has multiple allocation groups to maximize >>> parallelism, but we still have per-ag metadata that has to be tracked >>> globally with respect to each AG (e.g., free space trees, inode >>> allocation trees, etc.). Any operation that affects this metadata (e.g., >>> block/inode allocation) has to lock the agi/agf buffers along with any >>> buffers associated with the modified btree leaf/node blocks, etc. >>> >>> One example in your attached perf traces has several threads looking to >>> acquire the AGF, which is a per-AG data structure for tracking free >>> space in the AG. One thread looks like the inode eviction case noted >>> above (freeing blocks), another looks like a file truncate (also freeing >>> blocks), and yet another is a block allocation due to a direct I/O >>> write. Were any of these operations directed to an inode in a separate >>> AG, they would be able to proceed in parallel (but I believe they would >>> still hit the same codepaths as far as perf can tell). >> I guess we can mitigate (but not eliminate) this by creating more allocation >> groups. What is the default value for agsize? Are there any downsides to >> decreasing it, besides consuming more memory? >> > I suppose so, but I would be careful to check that you actually see > contention and test that increasing agcount actually helps. As > mentioned, I'm not sure off hand if the perf trace alone would look any > different if you have multiple metadata operations in progress on > separate AGs. > > My understanding is that there are diminishing returns to high AG counts > and usually 32-64 is sufficient for most storage. Dave might be able to > elaborate more on that... (I think this would make a good FAQ entry, > actually). > > The agsize/agcount mkfs-time heuristics change depending on the type of > storage. A single AG can be up to 1TB and if the fs is not considered > "multidisk" (e.g., no stripe unit/width is defined), 4 AGs is the > default up to 4TB. If a stripe unit is set, the agsize/agcount is > adjusted depending on the size of the overall volume (see > xfsprogs-dev/mkfs/xfs_mkfs.c:calc_default_ag_geometry() for details). We'll experiment with this. Surely it depends on more than the amount of storage? If you have a high op rate you'll be more likely to excite contention, no? > >> Are those locks held around I/O, or just CPU operations, or a mix? > I believe it's a mix of modifications and I/O, though it looks like some > of the I/O cases don't necessarily wait on the lock. E.g., the AIL > pushing case will trylock and defer to the next list iteration if the > buffer is busy. > Ok. For us sleeping in io_submit() is death because we have no other thread on that core to take its place. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-01 9:08 ` Avi Kivity @ 2015-12-01 13:11 ` Brian Foster 2015-12-01 13:58 ` Avi Kivity 0 siblings, 1 reply; 58+ messages in thread From: Brian Foster @ 2015-12-01 13:11 UTC (permalink / raw) To: Avi Kivity; +Cc: Glauber Costa, xfs On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote: > On 11/30/2015 06:14 PM, Brian Foster wrote: > >On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote: > >> > >>On 11/30/2015 04:10 PM, Brian Foster wrote: ... > >The agsize/agcount mkfs-time heuristics change depending on the type of > >storage. A single AG can be up to 1TB and if the fs is not considered > >"multidisk" (e.g., no stripe unit/width is defined), 4 AGs is the > >default up to 4TB. If a stripe unit is set, the agsize/agcount is > >adjusted depending on the size of the overall volume (see > >xfsprogs-dev/mkfs/xfs_mkfs.c:calc_default_ag_geometry() for details). > > We'll experiment with this. Surely it depends on more than the amount of > storage? If you have a high op rate you'll be more likely to excite > contention, no? > Sure. The absolute optimal configuration for your workload probably depends on more than storage size, but mkfs doesn't have that information. In general, it tries to use the most reasonable configuration based on the storage and expected workload. If you want to tweak it beyond that, indeed, the best bet is to experiment with what works. > > > >>Are those locks held around I/O, or just CPU operations, or a mix? > >I believe it's a mix of modifications and I/O, though it looks like some > >of the I/O cases don't necessarily wait on the lock. E.g., the AIL > >pushing case will trylock and defer to the next list iteration if the > >buffer is busy. > > > > Ok. For us sleeping in io_submit() is death because we have no other thread > on that core to take its place. > The above is with regard to metadata I/O, whereas io_submit() is obviously for user I/O. io_submit() can probably block in a variety of places afaict... it might have to read in the inode extent map, allocate blocks, take inode/ag locks, reserve log space for transactions, etc. It sounds to me that first and foremost you want to make sure you don't have however many parallel operations you typically have running contending on the same inodes or AGs. Hint: creating files under separate subdirectories is a quick and easy way to allocate inodes under separate AGs (the agno is encoded into the upper bits of the inode number). Reducing the frequency of block allocation/frees might also be another help (e.g., preallocate and reuse files, 'mount -o ikeep,' etc.). Beyond that, you probably want to make sure the log is large enough to support all concurrent operations. See the xfs_log_grant_* tracepoints for a window into if/how long transaction reservations might be waiting on the log. Brian > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-01 13:11 ` Brian Foster @ 2015-12-01 13:58 ` Avi Kivity 2015-12-01 14:01 ` Glauber Costa 2015-12-01 14:56 ` Brian Foster 0 siblings, 2 replies; 58+ messages in thread From: Avi Kivity @ 2015-12-01 13:58 UTC (permalink / raw) To: Brian Foster; +Cc: Glauber Costa, xfs On 12/01/2015 03:11 PM, Brian Foster wrote: > On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote: >> On 11/30/2015 06:14 PM, Brian Foster wrote: >>> On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote: >>>> On 11/30/2015 04:10 PM, Brian Foster wrote: > ... >>> The agsize/agcount mkfs-time heuristics change depending on the type of >>> storage. A single AG can be up to 1TB and if the fs is not considered >>> "multidisk" (e.g., no stripe unit/width is defined), 4 AGs is the >>> default up to 4TB. If a stripe unit is set, the agsize/agcount is >>> adjusted depending on the size of the overall volume (see >>> xfsprogs-dev/mkfs/xfs_mkfs.c:calc_default_ag_geometry() for details). >> We'll experiment with this. Surely it depends on more than the amount of >> storage? If you have a high op rate you'll be more likely to excite >> contention, no? >> > Sure. The absolute optimal configuration for your workload probably > depends on more than storage size, but mkfs doesn't have that > information. In general, it tries to use the most reasonable > configuration based on the storage and expected workload. If you want to > tweak it beyond that, indeed, the best bet is to experiment with what > works. We will do that. >>>> Are those locks held around I/O, or just CPU operations, or a mix? >>> I believe it's a mix of modifications and I/O, though it looks like some >>> of the I/O cases don't necessarily wait on the lock. E.g., the AIL >>> pushing case will trylock and defer to the next list iteration if the >>> buffer is busy. >>> >> Ok. For us sleeping in io_submit() is death because we have no other thread >> on that core to take its place. >> > The above is with regard to metadata I/O, whereas io_submit() is > obviously for user I/O. Won't io_submit() also trigger metadata I/O? Or is that all deferred to async tasks? I don't mind them blocking each other as long as they let my io_submit alone. > io_submit() can probably block in a variety of > places afaict... it might have to read in the inode extent map, allocate > blocks, take inode/ag locks, reserve log space for transactions, etc. Any chance of changing all that to be asynchronous? Doesn't sound too hard, if somebody else has to do it. > > It sounds to me that first and foremost you want to make sure you don't > have however many parallel operations you typically have running > contending on the same inodes or AGs. Hint: creating files under > separate subdirectories is a quick and easy way to allocate inodes under > separate AGs (the agno is encoded into the upper bits of the inode > number). Unfortunately our directory layout cannot be changed. And doesn't this require having agcount == O(number of active files)? That is easily in the thousands. > Reducing the frequency of block allocation/frees might also be > another help (e.g., preallocate and reuse files, Isn't that discouraged for SSDs? We can do that for a subset of our files. We do use XFS_IOC_FSSETXATTR though. > 'mount -o ikeep,' Interesting. Our files are large so we could try this. > etc.). Beyond that, you probably want to make sure the log is large > enough to support all concurrent operations. See the xfs_log_grant_* > tracepoints for a window into if/how long transaction reservations might > be waiting on the log. I see that on an 400G fs, the log is 180MB. Seems plenty large for write operations that are mostly large sequential, though I've no real feel for the numbers. Will keep an eye on this. Thanks for all the info. > Brian > >> _______________________________________________ >> xfs mailing list >> xfs@oss.sgi.com >> http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-01 13:58 ` Avi Kivity @ 2015-12-01 14:01 ` Glauber Costa 2015-12-01 14:37 ` Avi Kivity 2015-12-01 20:45 ` Dave Chinner 2015-12-01 14:56 ` Brian Foster 1 sibling, 2 replies; 58+ messages in thread From: Glauber Costa @ 2015-12-01 14:01 UTC (permalink / raw) To: Avi Kivity; +Cc: Brian Foster, xfs On Tue, Dec 1, 2015 at 8:58 AM, Avi Kivity <avi@scylladb.com> wrote: > > > On 12/01/2015 03:11 PM, Brian Foster wrote: >> >> On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote: >>> >>> On 11/30/2015 06:14 PM, Brian Foster wrote: >>>> >>>> On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote: >>>>> >>>>> On 11/30/2015 04:10 PM, Brian Foster wrote: >> >> ... >>>> >>>> The agsize/agcount mkfs-time heuristics change depending on the type of >>>> storage. A single AG can be up to 1TB and if the fs is not considered >>>> "multidisk" (e.g., no stripe unit/width is defined), 4 AGs is the >>>> default up to 4TB. If a stripe unit is set, the agsize/agcount is >>>> adjusted depending on the size of the overall volume (see >>>> xfsprogs-dev/mkfs/xfs_mkfs.c:calc_default_ag_geometry() for details). >>> >>> We'll experiment with this. Surely it depends on more than the amount of >>> storage? If you have a high op rate you'll be more likely to excite >>> contention, no? >>> >> Sure. The absolute optimal configuration for your workload probably >> depends on more than storage size, but mkfs doesn't have that >> information. In general, it tries to use the most reasonable >> configuration based on the storage and expected workload. If you want to >> tweak it beyond that, indeed, the best bet is to experiment with what >> works. > > > We will do that. > >>>>> Are those locks held around I/O, or just CPU operations, or a mix? >>>> >>>> I believe it's a mix of modifications and I/O, though it looks like some >>>> of the I/O cases don't necessarily wait on the lock. E.g., the AIL >>>> pushing case will trylock and defer to the next list iteration if the >>>> buffer is busy. >>>> >>> Ok. For us sleeping in io_submit() is death because we have no other >>> thread >>> on that core to take its place. >>> >> The above is with regard to metadata I/O, whereas io_submit() is >> obviously for user I/O. > > > Won't io_submit() also trigger metadata I/O? Or is that all deferred to > async tasks? I don't mind them blocking each other as long as they let my > io_submit alone. > >> io_submit() can probably block in a variety of >> places afaict... it might have to read in the inode extent map, allocate >> blocks, take inode/ag locks, reserve log space for transactions, etc. > > > Any chance of changing all that to be asynchronous? Doesn't sound too hard, > if somebody else has to do it. > >> >> It sounds to me that first and foremost you want to make sure you don't >> have however many parallel operations you typically have running >> contending on the same inodes or AGs. Hint: creating files under >> separate subdirectories is a quick and easy way to allocate inodes under >> separate AGs (the agno is encoded into the upper bits of the inode >> number). > > > Unfortunately our directory layout cannot be changed. And doesn't this > require having agcount == O(number of active files)? That is easily in the > thousands. Actually, wouldn't agcount == O(nr_cpus) be good enough? > >> Reducing the frequency of block allocation/frees might also be >> another help (e.g., preallocate and reuse files, > > > Isn't that discouraged for SSDs? > > We can do that for a subset of our files. > > We do use XFS_IOC_FSSETXATTR though. > >> 'mount -o ikeep,' > > > Interesting. Our files are large so we could try this. > >> etc.). Beyond that, you probably want to make sure the log is large >> enough to support all concurrent operations. See the xfs_log_grant_* >> tracepoints for a window into if/how long transaction reservations might >> be waiting on the log. > > > I see that on an 400G fs, the log is 180MB. Seems plenty large for write > operations that are mostly large sequential, though I've no real feel for > the numbers. Will keep an eye on this. > > Thanks for all the info. > > >> Brian >> >>> _______________________________________________ >>> xfs mailing list >>> xfs@oss.sgi.com >>> http://oss.sgi.com/mailman/listinfo/xfs > > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-01 14:01 ` Glauber Costa @ 2015-12-01 14:37 ` Avi Kivity 2015-12-01 20:45 ` Dave Chinner 1 sibling, 0 replies; 58+ messages in thread From: Avi Kivity @ 2015-12-01 14:37 UTC (permalink / raw) To: Glauber Costa; +Cc: Brian Foster, xfs On 12/01/2015 04:01 PM, Glauber Costa wrote: > On Tue, Dec 1, 2015 at 8:58 AM, Avi Kivity <avi@scylladb.com> wrote: >> >> On 12/01/2015 03:11 PM, Brian Foster wrote: >>> On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote: >>>> On 11/30/2015 06:14 PM, Brian Foster wrote: >>>>> On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote: >>>>>> On 11/30/2015 04:10 PM, Brian Foster wrote: >>> ... >>>>> The agsize/agcount mkfs-time heuristics change depending on the type of >>>>> storage. A single AG can be up to 1TB and if the fs is not considered >>>>> "multidisk" (e.g., no stripe unit/width is defined), 4 AGs is the >>>>> default up to 4TB. If a stripe unit is set, the agsize/agcount is >>>>> adjusted depending on the size of the overall volume (see >>>>> xfsprogs-dev/mkfs/xfs_mkfs.c:calc_default_ag_geometry() for details). >>>> We'll experiment with this. Surely it depends on more than the amount of >>>> storage? If you have a high op rate you'll be more likely to excite >>>> contention, no? >>>> >>> Sure. The absolute optimal configuration for your workload probably >>> depends on more than storage size, but mkfs doesn't have that >>> information. In general, it tries to use the most reasonable >>> configuration based on the storage and expected workload. If you want to >>> tweak it beyond that, indeed, the best bet is to experiment with what >>> works. >> >> We will do that. >> >>>>>> Are those locks held around I/O, or just CPU operations, or a mix? >>>>> I believe it's a mix of modifications and I/O, though it looks like some >>>>> of the I/O cases don't necessarily wait on the lock. E.g., the AIL >>>>> pushing case will trylock and defer to the next list iteration if the >>>>> buffer is busy. >>>>> >>>> Ok. For us sleeping in io_submit() is death because we have no other >>>> thread >>>> on that core to take its place. >>>> >>> The above is with regard to metadata I/O, whereas io_submit() is >>> obviously for user I/O. >> >> Won't io_submit() also trigger metadata I/O? Or is that all deferred to >> async tasks? I don't mind them blocking each other as long as they let my >> io_submit alone. >> >>> io_submit() can probably block in a variety of >>> places afaict... it might have to read in the inode extent map, allocate >>> blocks, take inode/ag locks, reserve log space for transactions, etc. >> >> Any chance of changing all that to be asynchronous? Doesn't sound too hard, >> if somebody else has to do it. >> >>> It sounds to me that first and foremost you want to make sure you don't >>> have however many parallel operations you typically have running >>> contending on the same inodes or AGs. Hint: creating files under >>> separate subdirectories is a quick and easy way to allocate inodes under >>> separate AGs (the agno is encoded into the upper bits of the inode >>> number). >> >> Unfortunately our directory layout cannot be changed. And doesn't this >> require having agcount == O(number of active files)? That is easily in the >> thousands. > Actually, wouldn't agcount == O(nr_cpus) be good enough? Depends on whether the locks are around I/O or cpu access only. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-01 14:01 ` Glauber Costa 2015-12-01 14:37 ` Avi Kivity @ 2015-12-01 20:45 ` Dave Chinner 2015-12-01 20:56 ` Avi Kivity 1 sibling, 1 reply; 58+ messages in thread From: Dave Chinner @ 2015-12-01 20:45 UTC (permalink / raw) To: Glauber Costa; +Cc: Avi Kivity, Brian Foster, xfs On Tue, Dec 01, 2015 at 09:01:13AM -0500, Glauber Costa wrote: > On Tue, Dec 1, 2015 at 8:58 AM, Avi Kivity <avi@scylladb.com> wrote: > > On 12/01/2015 03:11 PM, Brian Foster wrote: > >> It sounds to me that first and foremost you want to make sure you don't > >> have however many parallel operations you typically have running > >> contending on the same inodes or AGs. Hint: creating files under > >> separate subdirectories is a quick and easy way to allocate inodes under > >> separate AGs (the agno is encoded into the upper bits of the inode > >> number). > > > > > > Unfortunately our directory layout cannot be changed. And doesn't this > > require having agcount == O(number of active files)? That is easily in the > > thousands. > > Actually, wouldn't agcount == O(nr_cpus) be good enough? Not quite. What you need is agcount ~= O(nr_active_allocations). The difference is an allocation can block waiting on IO, and the CPU can then go off and run another process, which then tries to do an allocation. So you might only have 4 CPUs, but a workload that can have a hundred active allocations at once (not uncommon in file server workloads). On worklaods that are roughly 1 process per CPU, it's typical that agcount = 2 * N cpus gives pretty good results on large filesystems. If you've got 400GB filesystems or you are using spinning disks, then you probably don't want to go above 16 AGs, because then you have problems with maintaining continugous free space and you'll seek the spinning disks to death.... > >> 'mount -o ikeep,' > > > > > > Interesting. Our files are large so we could try this. Keep in mind that ikeep means that inode allocation permanently fragments free space, which can affect how large files are allocated once you truncate/rm the original files. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-01 20:45 ` Dave Chinner @ 2015-12-01 20:56 ` Avi Kivity 2015-12-01 23:41 ` Dave Chinner 0 siblings, 1 reply; 58+ messages in thread From: Avi Kivity @ 2015-12-01 20:56 UTC (permalink / raw) To: Dave Chinner, Glauber Costa; +Cc: Brian Foster, xfs On 12/01/2015 10:45 PM, Dave Chinner wrote: > On Tue, Dec 01, 2015 at 09:01:13AM -0500, Glauber Costa wrote: >> On Tue, Dec 1, 2015 at 8:58 AM, Avi Kivity <avi@scylladb.com> wrote: >>> On 12/01/2015 03:11 PM, Brian Foster wrote: >>>> It sounds to me that first and foremost you want to make sure you don't >>>> have however many parallel operations you typically have running >>>> contending on the same inodes or AGs. Hint: creating files under >>>> separate subdirectories is a quick and easy way to allocate inodes under >>>> separate AGs (the agno is encoded into the upper bits of the inode >>>> number). >>> >>> Unfortunately our directory layout cannot be changed. And doesn't this >>> require having agcount == O(number of active files)? That is easily in the >>> thousands. >> Actually, wouldn't agcount == O(nr_cpus) be good enough? > Not quite. What you need is agcount ~= O(nr_active_allocations). Yes, this is what I mean by "active files". > > The difference is an allocation can block waiting on IO, and the > CPU can then go off and run another process, which then tries to do > an allocation. So you might only have 4 CPUs, but a workload that > can have a hundred active allocations at once (not uncommon in > file server workloads). But for us, probably not much more. We try to restrict active I/Os to the effective disk queue depth (more than that and they just turn sour waiting in the disk queue). > On worklaods that are roughly 1 process per CPU, it's typical that > agcount = 2 * N cpus gives pretty good results on large filesystems. This is probably using sync calls. Using async calls you can have many more I/Os in progress (but still limited by effective disk queue depth). > If you've got 400GB filesystems or you are using spinning disks, > then you probably don't want to go above 16 AGs, because then you > have problems with maintaining continugous free space and you'll > seek the spinning disks to death.... We're concentrating on SSDs for now. > >>>> 'mount -o ikeep,' >>> >>> Interesting. Our files are large so we could try this. > Keep in mind that ikeep means that inode allocation permanently > fragments free space, which can affect how large files are allocated > once you truncate/rm the original files. > > We can try to prime this by allocating a lot of inodes up front, then removing them, so that this doesn't happen. Hurray ext2. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-01 20:56 ` Avi Kivity @ 2015-12-01 23:41 ` Dave Chinner 2015-12-02 8:23 ` Avi Kivity 0 siblings, 1 reply; 58+ messages in thread From: Dave Chinner @ 2015-12-01 23:41 UTC (permalink / raw) To: Avi Kivity; +Cc: Brian Foster, Glauber Costa, xfs On Tue, Dec 01, 2015 at 10:56:01PM +0200, Avi Kivity wrote: > On 12/01/2015 10:45 PM, Dave Chinner wrote: > >On Tue, Dec 01, 2015 at 09:01:13AM -0500, Glauber Costa wrote: > >The difference is an allocation can block waiting on IO, and the > >CPU can then go off and run another process, which then tries to do > >an allocation. So you might only have 4 CPUs, but a workload that > >can have a hundred active allocations at once (not uncommon in > >file server workloads). > > But for us, probably not much more. We try to restrict active I/Os > to the effective disk queue depth (more than that and they just turn > sour waiting in the disk queue). > > > >On worklaods that are roughly 1 process per CPU, it's typical that > >agcount = 2 * N cpus gives pretty good results on large filesystems. > > This is probably using sync calls. Using async calls you can have > many more I/Os in progress (but still limited by effective disk > queue depth). Ah, no. Even with async IO you don't want unbound allocation concurrency. The allocation algorithms rely on having contiguous free space extents that are much larger than the allocations being done to work effeectively and minimise file fragmentation. If you chop the filesystem up into lots of small AGs, then it accelerates the rate at which the free space gets chopped up into smaller extents and performance then suffers. It's the same problem as running a large filesystem near ENOSPC for an extended period of time, which again is something we most definitely don't recommend you do in production systems. > >If you've got 400GB filesystems or you are using spinning disks, > >then you probably don't want to go above 16 AGs, because then you > >have problems with maintaining continugous free space and you'll > >seek the spinning disks to death.... > > We're concentrating on SSDs for now. Sure, so "problems with maintaining continugous free space" is what you need to be concerned about. > >>>>'mount -o ikeep,' > >>> > >>>Interesting. Our files are large so we could try this. > >Keep in mind that ikeep means that inode allocation permanently > >fragments free space, which can affect how large files are allocated > >once you truncate/rm the original files. > > We can try to prime this by allocating a lot of inodes up front, > then removing them, so that this doesn't happen. Again - what problem have you measured that inode preallocation will solves in your application? Don't make changes just because you *think* it will fix what you *think* is a problem. Measure, analyse, solve, in that order. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-01 23:41 ` Dave Chinner @ 2015-12-02 8:23 ` Avi Kivity 0 siblings, 0 replies; 58+ messages in thread From: Avi Kivity @ 2015-12-02 8:23 UTC (permalink / raw) To: Dave Chinner; +Cc: Brian Foster, Glauber Costa, xfs On 12/02/2015 01:41 AM, Dave Chinner wrote: > On Tue, Dec 01, 2015 at 10:56:01PM +0200, Avi Kivity wrote: >> On 12/01/2015 10:45 PM, Dave Chinner wrote: >>> On Tue, Dec 01, 2015 at 09:01:13AM -0500, Glauber Costa wrote: >>> The difference is an allocation can block waiting on IO, and the >>> CPU can then go off and run another process, which then tries to do >>> an allocation. So you might only have 4 CPUs, but a workload that >>> can have a hundred active allocations at once (not uncommon in >>> file server workloads). >> But for us, probably not much more. We try to restrict active I/Os >> to the effective disk queue depth (more than that and they just turn >> sour waiting in the disk queue). >> >> >>> On worklaods that are roughly 1 process per CPU, it's typical that >>> agcount = 2 * N cpus gives pretty good results on large filesystems. >> This is probably using sync calls. Using async calls you can have >> many more I/Os in progress (but still limited by effective disk >> queue depth). > Ah, no. Even with async IO you don't want unbound allocation > concurrency. Unbound, certainly not. But if my disk want 100 concurrent operations to deliver maximum bandwidth, and XFS wants fewer concurrent allocations to satisfy some internal constraint, then I can't satisfy both. To be fair, the number 100 was measured for 4k reads. It's sure to be much lower for 128k writes, and since we set an extent size hint of 1MB, only 1/8th of those will be allocating. So I expect things to work in practice, at least with the current generation of disks. Unfortunately disk bandwidth is growing faster than latency is improving, which means that the effective concurrency is increasing. > The allocation algorithms rely on having contiguous > free space extents that are much larger than the allocations being > done to work effeectively and minimise file fragmentation. If you > chop the filesystem up into lots of small AGs, then it accelerates > the rate at which the free space gets chopped up into smaller > extents and performance then suffers. It's the same problem as > running a large filesystem near ENOSPC for an extended period of > time, which again is something we most definitely don't recommend > you do in production systems. I understand. I guess it makes ag randomization even more important, for our use case. What happens when an ag fills up? Can a file overflow to another ag? > >>> If you've got 400GB filesystems or you are using spinning disks, >>> then you probably don't want to go above 16 AGs, because then you >>> have problems with maintaining continugous free space and you'll >>> seek the spinning disks to death.... >> We're concentrating on SSDs for now. > Sure, so "problems with maintaining continugous free space" is what > you need to be concerned about. Right. Luckily our allocation patterns are very friendly towards that. We have append-only files that grow rapidly, then are immutable for a time, then are deleted. (It is a log-structured database so a natual fit for SSDs). We can increase our extent size hint if it will help the SSD any. > >>>>>> 'mount -o ikeep,' >>>>> Interesting. Our files are large so we could try this. >>> Keep in mind that ikeep means that inode allocation permanently >>> fragments free space, which can affect how large files are allocated >>> once you truncate/rm the original files. >> We can try to prime this by allocating a lot of inodes up front, >> then removing them, so that this doesn't happen. > Again - what problem have you measured that inode preallocation will > solves in your application? Don't make changes just because you > *think* it will fix what you *think* is a problem. Measure, analyse, > solve, in that order. We are now investigating what we can do to fix the problem, we aren't committing to any solution yet. Certainly we plan to be certain of what the problem is before we fix it. Up until a few days ago we never saw any blocks with XFS, and were very happy -- but that was with 90us, 450k IOPS disks. With the slower disks, accessed through a certain hypervisor, we do see XFS block, and it is very worrying. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-01 13:58 ` Avi Kivity 2015-12-01 14:01 ` Glauber Costa @ 2015-12-01 14:56 ` Brian Foster 2015-12-01 15:22 ` Avi Kivity 1 sibling, 1 reply; 58+ messages in thread From: Brian Foster @ 2015-12-01 14:56 UTC (permalink / raw) To: Avi Kivity; +Cc: Glauber Costa, xfs On Tue, Dec 01, 2015 at 03:58:28PM +0200, Avi Kivity wrote: > > > On 12/01/2015 03:11 PM, Brian Foster wrote: > >On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote: > >>On 11/30/2015 06:14 PM, Brian Foster wrote: > >>>On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote: > >>>>On 11/30/2015 04:10 PM, Brian Foster wrote: > >... > >>>The agsize/agcount mkfs-time heuristics change depending on the type of > >>>storage. A single AG can be up to 1TB and if the fs is not considered > >>>"multidisk" (e.g., no stripe unit/width is defined), 4 AGs is the > >>>default up to 4TB. If a stripe unit is set, the agsize/agcount is > >>>adjusted depending on the size of the overall volume (see > >>>xfsprogs-dev/mkfs/xfs_mkfs.c:calc_default_ag_geometry() for details). > >>We'll experiment with this. Surely it depends on more than the amount of > >>storage? If you have a high op rate you'll be more likely to excite > >>contention, no? > >> > >Sure. The absolute optimal configuration for your workload probably > >depends on more than storage size, but mkfs doesn't have that > >information. In general, it tries to use the most reasonable > >configuration based on the storage and expected workload. If you want to > >tweak it beyond that, indeed, the best bet is to experiment with what > >works. > > We will do that. > > >>>>Are those locks held around I/O, or just CPU operations, or a mix? > >>>I believe it's a mix of modifications and I/O, though it looks like some > >>>of the I/O cases don't necessarily wait on the lock. E.g., the AIL > >>>pushing case will trylock and defer to the next list iteration if the > >>>buffer is busy. > >>> > >>Ok. For us sleeping in io_submit() is death because we have no other thread > >>on that core to take its place. > >> > >The above is with regard to metadata I/O, whereas io_submit() is > >obviously for user I/O. > > Won't io_submit() also trigger metadata I/O? Or is that all deferred to > async tasks? I don't mind them blocking each other as long as they let my > io_submit alone. > Yeah, it can trigger metadata reads, force the log (the stale buffer example) or push the AIL (wait on log space). Metadata changes made directly via your I/O request are logged/committed via transactions, which are generally processed asynchronously from that point on. > > io_submit() can probably block in a variety of > >places afaict... it might have to read in the inode extent map, allocate > >blocks, take inode/ag locks, reserve log space for transactions, etc. > > Any chance of changing all that to be asynchronous? Doesn't sound too hard, > if somebody else has to do it. > I'm not following... if the fs needs to read in the inode extent map to prepare for an allocation, what else can the thread do but wait? Are you suggesting the request kick off whatever the blocking action happens to be asynchronously and return with an error such that the request can be retried later? > > > >It sounds to me that first and foremost you want to make sure you don't > >have however many parallel operations you typically have running > >contending on the same inodes or AGs. Hint: creating files under > >separate subdirectories is a quick and easy way to allocate inodes under > >separate AGs (the agno is encoded into the upper bits of the inode > >number). > > Unfortunately our directory layout cannot be changed. And doesn't this > require having agcount == O(number of active files)? That is easily in the > thousands. > I think Glauber's O(nr_cpus) comment is probably the more likely ballpark, but really it's something you'll probably just need to test to see how far you need to go to avoid AG contention. I'm primarily throwing the subdir thing out there for testing purposes. It's just an easy way to create inodes in a bunch of separate AGs so you can determine whether/how much it really helps with modified AG counts. I don't know enough about your application design to really comment on that... > > Reducing the frequency of block allocation/frees might also be > >another help (e.g., preallocate and reuse files, > > Isn't that discouraged for SSDs? > Perhaps, if you're referring to the fact that the blocks are never freed and thus never discarded..? Are you running fstrim? If so, it would certainly impact that by holding blocks as allocated to inodes as opposed to putting them in free space trees where they can be discarded. If not, I don't see how it would make a difference, but perhaps I misunderstand the point. That said, there's probably others on the list who can more definitively discuss SSD characteristics than I... > We can do that for a subset of our files. > > We do use XFS_IOC_FSSETXATTR though. > > >'mount -o ikeep,' > > Interesting. Our files are large so we could try this. > Just to be clear... this behavior change is more directly associated with file count than file size (though indirectly larger files might mean you have less of them, if that's your point). To generalize a bit, I'd be more weary of using this option if your filesystem can be used in an unstructured manner in any way. For example, if the file count can balloon up and back down temporarily, that's going to allocate a bunch of metadata space for inodes that won't ever be reclaimed or reused for anything other than inodes. > >etc.). Beyond that, you probably want to make sure the log is large > >enough to support all concurrent operations. See the xfs_log_grant_* > >tracepoints for a window into if/how long transaction reservations might > >be waiting on the log. > > I see that on an 400G fs, the log is 180MB. Seems plenty large for write > operations that are mostly large sequential, though I've no real feel for > the numbers. Will keep an eye on this. > FWIW, XFS on recent kernels has grown some sysfs entries that might help give an idea of log reservation state at runtime. See the entries under /sys/fs/xfs/<dev>/log for details. Brian > Thanks for all the info. > > >Brian > > > >>_______________________________________________ > >>xfs mailing list > >>xfs@oss.sgi.com > >>http://oss.sgi.com/mailman/listinfo/xfs > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-01 14:56 ` Brian Foster @ 2015-12-01 15:22 ` Avi Kivity 2015-12-01 16:01 ` Brian Foster 2015-12-01 21:04 ` Dave Chinner 0 siblings, 2 replies; 58+ messages in thread From: Avi Kivity @ 2015-12-01 15:22 UTC (permalink / raw) To: Brian Foster; +Cc: Glauber Costa, xfs On 12/01/2015 04:56 PM, Brian Foster wrote: > On Tue, Dec 01, 2015 at 03:58:28PM +0200, Avi Kivity wrote: >> >> On 12/01/2015 03:11 PM, Brian Foster wrote: >>> On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote: >>>> On 11/30/2015 06:14 PM, Brian Foster wrote: >>>>> On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote: >>>>>> On 11/30/2015 04:10 PM, Brian Foster wrote: >>> ... >>>>> The agsize/agcount mkfs-time heuristics change depending on the type of >>>>> storage. A single AG can be up to 1TB and if the fs is not considered >>>>> "multidisk" (e.g., no stripe unit/width is defined), 4 AGs is the >>>>> default up to 4TB. If a stripe unit is set, the agsize/agcount is >>>>> adjusted depending on the size of the overall volume (see >>>>> xfsprogs-dev/mkfs/xfs_mkfs.c:calc_default_ag_geometry() for details). >>>> We'll experiment with this. Surely it depends on more than the amount of >>>> storage? If you have a high op rate you'll be more likely to excite >>>> contention, no? >>>> >>> Sure. The absolute optimal configuration for your workload probably >>> depends on more than storage size, but mkfs doesn't have that >>> information. In general, it tries to use the most reasonable >>> configuration based on the storage and expected workload. If you want to >>> tweak it beyond that, indeed, the best bet is to experiment with what >>> works. >> We will do that. >> >>>>>> Are those locks held around I/O, or just CPU operations, or a mix? >>>>> I believe it's a mix of modifications and I/O, though it looks like some >>>>> of the I/O cases don't necessarily wait on the lock. E.g., the AIL >>>>> pushing case will trylock and defer to the next list iteration if the >>>>> buffer is busy. >>>>> >>>> Ok. For us sleeping in io_submit() is death because we have no other thread >>>> on that core to take its place. >>>> >>> The above is with regard to metadata I/O, whereas io_submit() is >>> obviously for user I/O. >> Won't io_submit() also trigger metadata I/O? Or is that all deferred to >> async tasks? I don't mind them blocking each other as long as they let my >> io_submit alone. >> > Yeah, it can trigger metadata reads, force the log (the stale buffer > example) or push the AIL (wait on log space). Metadata changes made > directly via your I/O request are logged/committed via transactions, > which are generally processed asynchronously from that point on. > >>> io_submit() can probably block in a variety of >>> places afaict... it might have to read in the inode extent map, allocate >>> blocks, take inode/ag locks, reserve log space for transactions, etc. >> Any chance of changing all that to be asynchronous? Doesn't sound too hard, >> if somebody else has to do it. >> > I'm not following... if the fs needs to read in the inode extent map to > prepare for an allocation, what else can the thread do but wait? Are you > suggesting the request kick off whatever the blocking action happens to > be asynchronously and return with an error such that the request can be > retried later? Not quite, it should be invisible to the caller. That is, the code called by io_submit() (file_operations::write_iter, it seems to be called today) can kick off this operation and have it continue from where it left off. Seastar (the async user framework which we use to drive xfs) makes writing code like this easy, using continuations; but of course from ordinary threaded code it can be quite hard. btw, there was an attempt to make ext[34] async using this method, but I think it was ripped out. Yes, the mortal remains can still be seen with 'git grep EIOCBQUEUED'. > >>> It sounds to me that first and foremost you want to make sure you don't >>> have however many parallel operations you typically have running >>> contending on the same inodes or AGs. Hint: creating files under >>> separate subdirectories is a quick and easy way to allocate inodes under >>> separate AGs (the agno is encoded into the upper bits of the inode >>> number). >> Unfortunately our directory layout cannot be changed. And doesn't this >> require having agcount == O(number of active files)? That is easily in the >> thousands. >> > I think Glauber's O(nr_cpus) comment is probably the more likely > ballpark, but really it's something you'll probably just need to test to > see how far you need to go to avoid AG contention. > > I'm primarily throwing the subdir thing out there for testing purposes. > It's just an easy way to create inodes in a bunch of separate AGs so you > can determine whether/how much it really helps with modified AG counts. > I don't know enough about your application design to really comment on > that... We have O(cpus) shards that operate independently. Each shard writes 32MB commitlog files (that are pre-truncated to 32MB to allow concurrent writes without blocking); the files are then flushed and closed, and later removed. In parallel there are sequential writes and reads of large files using 128kB buffers), as well as random reads. Files are immutable (append-only), and if a file is being written, it is not concurrently read. In general files are not shared across shards. All I/O is async and O_DIRECT. open(), truncate(), fdatasync(), and friends are called from a helper thread. As far as I can tell it should a very friendly load for XFS and SSDs. > >>> Reducing the frequency of block allocation/frees might also be >>> another help (e.g., preallocate and reuse files, >> Isn't that discouraged for SSDs? >> > Perhaps, if you're referring to the fact that the blocks are never freed > and thus never discarded..? Are you running fstrim? mount -o discard. And yes, overwrites are supposedly more expensive than trim old data + allocate new data, but maybe if you compare it with the work XFS has to do, perhaps the tradeoff is bad. > > If so, it would certainly impact that by holding blocks as allocated to > inodes as opposed to putting them in free space trees where they can be > discarded. If not, I don't see how it would make a difference, but > perhaps I misunderstand the point. That said, there's probably others on > the list who can more definitively discuss SSD characteristics than I... > >> We can do that for a subset of our files. >> >> We do use XFS_IOC_FSSETXATTR though. >> >>> 'mount -o ikeep,' >> Interesting. Our files are large so we could try this. >> > Just to be clear... this behavior change is more directly associated > with file count than file size (though indirectly larger files might > mean you have less of them, if that's your point). Yes, that's what I meant, and especially that if a lot of files are removed we'd be losing the inode space allocated to them. > > To generalize a bit, I'd be more weary of using this option if your > filesystem can be used in an unstructured manner in any way. For > example, if the file count can balloon up and back down temporarily, > that's going to allocate a bunch of metadata space for inodes that won't > ever be reclaimed or reused for anything other than inodes. Exactly. File count can balloon, but files will be large, so even the worst case waste is very limited. > >>> etc.). Beyond that, you probably want to make sure the log is large >>> enough to support all concurrent operations. See the xfs_log_grant_* >>> tracepoints for a window into if/how long transaction reservations might >>> be waiting on the log. >> I see that on an 400G fs, the log is 180MB. Seems plenty large for write >> operations that are mostly large sequential, though I've no real feel for >> the numbers. Will keep an eye on this. >> > FWIW, XFS on recent kernels has grown some sysfs entries that might help > give an idea of log reservation state at runtime. See the entries under > /sys/fs/xfs/<dev>/log for details. Great. We will study those with great interest. > > Brian > >> Thanks for all the info. >> >>> Brian >>> >>>> _______________________________________________ >>>> xfs mailing list >>>> xfs@oss.sgi.com >>>> http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-01 15:22 ` Avi Kivity @ 2015-12-01 16:01 ` Brian Foster 2015-12-01 16:08 ` Avi Kivity 2015-12-01 21:04 ` Dave Chinner 1 sibling, 1 reply; 58+ messages in thread From: Brian Foster @ 2015-12-01 16:01 UTC (permalink / raw) To: Avi Kivity; +Cc: Glauber Costa, xfs On Tue, Dec 01, 2015 at 05:22:38PM +0200, Avi Kivity wrote: > > > On 12/01/2015 04:56 PM, Brian Foster wrote: > >On Tue, Dec 01, 2015 at 03:58:28PM +0200, Avi Kivity wrote: > >> > >>On 12/01/2015 03:11 PM, Brian Foster wrote: > >>>On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote: > >>>>On 11/30/2015 06:14 PM, Brian Foster wrote: > >>>>>On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote: > >>>>>>On 11/30/2015 04:10 PM, Brian Foster wrote: > >>>... > >>>>>The agsize/agcount mkfs-time heuristics change depending on the type of > >>>>>storage. A single AG can be up to 1TB and if the fs is not considered > >>>>>"multidisk" (e.g., no stripe unit/width is defined), 4 AGs is the > >>>>>default up to 4TB. If a stripe unit is set, the agsize/agcount is > >>>>>adjusted depending on the size of the overall volume (see > >>>>>xfsprogs-dev/mkfs/xfs_mkfs.c:calc_default_ag_geometry() for details). > >>>>We'll experiment with this. Surely it depends on more than the amount of > >>>>storage? If you have a high op rate you'll be more likely to excite > >>>>contention, no? > >>>> > >>>Sure. The absolute optimal configuration for your workload probably > >>>depends on more than storage size, but mkfs doesn't have that > >>>information. In general, it tries to use the most reasonable > >>>configuration based on the storage and expected workload. If you want to > >>>tweak it beyond that, indeed, the best bet is to experiment with what > >>>works. > >>We will do that. > >> > >>>>>>Are those locks held around I/O, or just CPU operations, or a mix? > >>>>>I believe it's a mix of modifications and I/O, though it looks like some > >>>>>of the I/O cases don't necessarily wait on the lock. E.g., the AIL > >>>>>pushing case will trylock and defer to the next list iteration if the > >>>>>buffer is busy. > >>>>> > >>>>Ok. For us sleeping in io_submit() is death because we have no other thread > >>>>on that core to take its place. > >>>> > >>>The above is with regard to metadata I/O, whereas io_submit() is > >>>obviously for user I/O. > >>Won't io_submit() also trigger metadata I/O? Or is that all deferred to > >>async tasks? I don't mind them blocking each other as long as they let my > >>io_submit alone. > >> > >Yeah, it can trigger metadata reads, force the log (the stale buffer > >example) or push the AIL (wait on log space). Metadata changes made > >directly via your I/O request are logged/committed via transactions, > >which are generally processed asynchronously from that point on. > > > >>> io_submit() can probably block in a variety of > >>>places afaict... it might have to read in the inode extent map, allocate > >>>blocks, take inode/ag locks, reserve log space for transactions, etc. > >>Any chance of changing all that to be asynchronous? Doesn't sound too hard, > >>if somebody else has to do it. > >> > >I'm not following... if the fs needs to read in the inode extent map to > >prepare for an allocation, what else can the thread do but wait? Are you > >suggesting the request kick off whatever the blocking action happens to > >be asynchronously and return with an error such that the request can be > >retried later? > > Not quite, it should be invisible to the caller. > > That is, the code called by io_submit() (file_operations::write_iter, it > seems to be called today) can kick off this operation and have it continue > from where it left off. > Isn't that generally what happens today? We submit an I/O which is asynchronous in nature and wait on a completion, which causes the cpu to schedule and execute another task until the completion is set by I/O completion (via an async callback). At that point, the issuing thread continues where it left off. I suspect I'm missing something... can you elaborate on what you'd do differently here (and how it helps)? > Seastar (the async user framework which we use to drive xfs) makes writing > code like this easy, using continuations; but of course from ordinary > threaded code it can be quite hard. > > btw, there was an attempt to make ext[34] async using this method, but I > think it was ripped out. Yes, the mortal remains can still be seen with > 'git grep EIOCBQUEUED'. > > > > >>>It sounds to me that first and foremost you want to make sure you don't > >>>have however many parallel operations you typically have running > >>>contending on the same inodes or AGs. Hint: creating files under > >>>separate subdirectories is a quick and easy way to allocate inodes under > >>>separate AGs (the agno is encoded into the upper bits of the inode > >>>number). > >>Unfortunately our directory layout cannot be changed. And doesn't this > >>require having agcount == O(number of active files)? That is easily in the > >>thousands. > >> > >I think Glauber's O(nr_cpus) comment is probably the more likely > >ballpark, but really it's something you'll probably just need to test to > >see how far you need to go to avoid AG contention. > > > >I'm primarily throwing the subdir thing out there for testing purposes. > >It's just an easy way to create inodes in a bunch of separate AGs so you > >can determine whether/how much it really helps with modified AG counts. > >I don't know enough about your application design to really comment on > >that... > > We have O(cpus) shards that operate independently. Each shard writes 32MB > commitlog files (that are pre-truncated to 32MB to allow concurrent writes > without blocking); the files are then flushed and closed, and later removed. > In parallel there are sequential writes and reads of large files using 128kB > buffers), as well as random reads. Files are immutable (append-only), and > if a file is being written, it is not concurrently read. In general files > are not shared across shards. All I/O is async and O_DIRECT. open(), > truncate(), fdatasync(), and friends are called from a helper thread. > > As far as I can tell it should a very friendly load for XFS and SSDs. > > > > >>> Reducing the frequency of block allocation/frees might also be > >>>another help (e.g., preallocate and reuse files, > >>Isn't that discouraged for SSDs? > >> > >Perhaps, if you're referring to the fact that the blocks are never freed > >and thus never discarded..? Are you running fstrim? > > mount -o discard. And yes, overwrites are supposedly more expensive than > trim old data + allocate new data, but maybe if you compare it with the work > XFS has to do, perhaps the tradeoff is bad. > Ok, my understanding is that '-o discard' is not recommended in favor of periodic fstrim for performance reasons, but that may or may not still be the case. Brian > > > > >If so, it would certainly impact that by holding blocks as allocated to > >inodes as opposed to putting them in free space trees where they can be > >discarded. If not, I don't see how it would make a difference, but > >perhaps I misunderstand the point. That said, there's probably others on > >the list who can more definitively discuss SSD characteristics than I... > > > > > > >>We can do that for a subset of our files. > >> > >>We do use XFS_IOC_FSSETXATTR though. > >> > >>>'mount -o ikeep,' > >>Interesting. Our files are large so we could try this. > >> > >Just to be clear... this behavior change is more directly associated > >with file count than file size (though indirectly larger files might > >mean you have less of them, if that's your point). > > Yes, that's what I meant, and especially that if a lot of files are removed > we'd be losing the inode space allocated to them. > > > > >To generalize a bit, I'd be more weary of using this option if your > >filesystem can be used in an unstructured manner in any way. For > >example, if the file count can balloon up and back down temporarily, > >that's going to allocate a bunch of metadata space for inodes that won't > >ever be reclaimed or reused for anything other than inodes. > > Exactly. File count can balloon, but files will be large, so even the worst > case waste is very limited. > > > > >>>etc.). Beyond that, you probably want to make sure the log is large > >>>enough to support all concurrent operations. See the xfs_log_grant_* > >>>tracepoints for a window into if/how long transaction reservations might > >>>be waiting on the log. > >>I see that on an 400G fs, the log is 180MB. Seems plenty large for write > >>operations that are mostly large sequential, though I've no real feel for > >>the numbers. Will keep an eye on this. > >> > >FWIW, XFS on recent kernels has grown some sysfs entries that might help > >give an idea of log reservation state at runtime. See the entries under > >/sys/fs/xfs/<dev>/log for details. > > Great. We will study those with great interest. > > > > >Brian > > > >>Thanks for all the info. > >> > >>>Brian > >>> > >>>>_______________________________________________ > >>>>xfs mailing list > >>>>xfs@oss.sgi.com > >>>>http://oss.sgi.com/mailman/listinfo/xfs > > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-01 16:01 ` Brian Foster @ 2015-12-01 16:08 ` Avi Kivity 2015-12-01 16:29 ` Brian Foster 0 siblings, 1 reply; 58+ messages in thread From: Avi Kivity @ 2015-12-01 16:08 UTC (permalink / raw) To: Brian Foster; +Cc: Glauber Costa, xfs On 12/01/2015 06:01 PM, Brian Foster wrote: > On Tue, Dec 01, 2015 at 05:22:38PM +0200, Avi Kivity wrote: >> >> On 12/01/2015 04:56 PM, Brian Foster wrote: >>> On Tue, Dec 01, 2015 at 03:58:28PM +0200, Avi Kivity wrote: >>>> On 12/01/2015 03:11 PM, Brian Foster wrote: >>>>> On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote: >>>>>> On 11/30/2015 06:14 PM, Brian Foster wrote: >>>>>>> On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote: >>>>>>>> On 11/30/2015 04:10 PM, Brian Foster wrote: >>>>> ... >>>>>>> The agsize/agcount mkfs-time heuristics change depending on the type of >>>>>>> storage. A single AG can be up to 1TB and if the fs is not considered >>>>>>> "multidisk" (e.g., no stripe unit/width is defined), 4 AGs is the >>>>>>> default up to 4TB. If a stripe unit is set, the agsize/agcount is >>>>>>> adjusted depending on the size of the overall volume (see >>>>>>> xfsprogs-dev/mkfs/xfs_mkfs.c:calc_default_ag_geometry() for details). >>>>>> We'll experiment with this. Surely it depends on more than the amount of >>>>>> storage? If you have a high op rate you'll be more likely to excite >>>>>> contention, no? >>>>>> >>>>> Sure. The absolute optimal configuration for your workload probably >>>>> depends on more than storage size, but mkfs doesn't have that >>>>> information. In general, it tries to use the most reasonable >>>>> configuration based on the storage and expected workload. If you want to >>>>> tweak it beyond that, indeed, the best bet is to experiment with what >>>>> works. >>>> We will do that. >>>> >>>>>>>> Are those locks held around I/O, or just CPU operations, or a mix? >>>>>>> I believe it's a mix of modifications and I/O, though it looks like some >>>>>>> of the I/O cases don't necessarily wait on the lock. E.g., the AIL >>>>>>> pushing case will trylock and defer to the next list iteration if the >>>>>>> buffer is busy. >>>>>>> >>>>>> Ok. For us sleeping in io_submit() is death because we have no other thread >>>>>> on that core to take its place. >>>>>> >>>>> The above is with regard to metadata I/O, whereas io_submit() is >>>>> obviously for user I/O. >>>> Won't io_submit() also trigger metadata I/O? Or is that all deferred to >>>> async tasks? I don't mind them blocking each other as long as they let my >>>> io_submit alone. >>>> >>> Yeah, it can trigger metadata reads, force the log (the stale buffer >>> example) or push the AIL (wait on log space). Metadata changes made >>> directly via your I/O request are logged/committed via transactions, >>> which are generally processed asynchronously from that point on. >>> >>>>> io_submit() can probably block in a variety of >>>>> places afaict... it might have to read in the inode extent map, allocate >>>>> blocks, take inode/ag locks, reserve log space for transactions, etc. >>>> Any chance of changing all that to be asynchronous? Doesn't sound too hard, >>>> if somebody else has to do it. >>>> >>> I'm not following... if the fs needs to read in the inode extent map to >>> prepare for an allocation, what else can the thread do but wait? Are you >>> suggesting the request kick off whatever the blocking action happens to >>> be asynchronously and return with an error such that the request can be >>> retried later? >> Not quite, it should be invisible to the caller. >> >> That is, the code called by io_submit() (file_operations::write_iter, it >> seems to be called today) can kick off this operation and have it continue >> from where it left off. >> > Isn't that generally what happens today? You tell me. According to $subject, apparently not enough. Maybe we're triggering it more often, or we suffer more when it does trigger (the latter probably more likely). > We submit an I/O which is > asynchronous in nature and wait on a completion, which causes the cpu to > schedule and execute another task until the completion is set by I/O > completion (via an async callback). At that point, the issuing thread > continues where it left off. I suspect I'm missing something... can you > elaborate on what you'd do differently here (and how it helps)? Just apply the same technique everywhere: convert locks to trylock + schedule a continuation on failure. > >> Seastar (the async user framework which we use to drive xfs) makes writing >> code like this easy, using continuations; but of course from ordinary >> threaded code it can be quite hard. >> >> btw, there was an attempt to make ext[34] async using this method, but I >> think it was ripped out. Yes, the mortal remains can still be seen with >> 'git grep EIOCBQUEUED'. >> >>>>> It sounds to me that first and foremost you want to make sure you don't >>>>> have however many parallel operations you typically have running >>>>> contending on the same inodes or AGs. Hint: creating files under >>>>> separate subdirectories is a quick and easy way to allocate inodes under >>>>> separate AGs (the agno is encoded into the upper bits of the inode >>>>> number). >>>> Unfortunately our directory layout cannot be changed. And doesn't this >>>> require having agcount == O(number of active files)? That is easily in the >>>> thousands. >>>> >>> I think Glauber's O(nr_cpus) comment is probably the more likely >>> ballpark, but really it's something you'll probably just need to test to >>> see how far you need to go to avoid AG contention. >>> >>> I'm primarily throwing the subdir thing out there for testing purposes. >>> It's just an easy way to create inodes in a bunch of separate AGs so you >>> can determine whether/how much it really helps with modified AG counts. >>> I don't know enough about your application design to really comment on >>> that... >> We have O(cpus) shards that operate independently. Each shard writes 32MB >> commitlog files (that are pre-truncated to 32MB to allow concurrent writes >> without blocking); the files are then flushed and closed, and later removed. >> In parallel there are sequential writes and reads of large files using 128kB >> buffers), as well as random reads. Files are immutable (append-only), and >> if a file is being written, it is not concurrently read. In general files >> are not shared across shards. All I/O is async and O_DIRECT. open(), >> truncate(), fdatasync(), and friends are called from a helper thread. >> >> As far as I can tell it should a very friendly load for XFS and SSDs. >> >>>>> Reducing the frequency of block allocation/frees might also be >>>>> another help (e.g., preallocate and reuse files, >>>> Isn't that discouraged for SSDs? >>>> >>> Perhaps, if you're referring to the fact that the blocks are never freed >>> and thus never discarded..? Are you running fstrim? >> mount -o discard. And yes, overwrites are supposedly more expensive than >> trim old data + allocate new data, but maybe if you compare it with the work >> XFS has to do, perhaps the tradeoff is bad. >> > Ok, my understanding is that '-o discard' is not recommended in favor of > periodic fstrim for performance reasons, but that may or may not still > be the case. I understand that most SSDs have queued trim these days, but maybe I'm optimistic. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-01 16:08 ` Avi Kivity @ 2015-12-01 16:29 ` Brian Foster 2015-12-01 17:09 ` Avi Kivity 0 siblings, 1 reply; 58+ messages in thread From: Brian Foster @ 2015-12-01 16:29 UTC (permalink / raw) To: Avi Kivity; +Cc: Glauber Costa, xfs On Tue, Dec 01, 2015 at 06:08:51PM +0200, Avi Kivity wrote: > > > On 12/01/2015 06:01 PM, Brian Foster wrote: > >On Tue, Dec 01, 2015 at 05:22:38PM +0200, Avi Kivity wrote: > >> > >>On 12/01/2015 04:56 PM, Brian Foster wrote: > >>>On Tue, Dec 01, 2015 at 03:58:28PM +0200, Avi Kivity wrote: > >>>>On 12/01/2015 03:11 PM, Brian Foster wrote: > >>>>>On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote: > >>>>>>On 11/30/2015 06:14 PM, Brian Foster wrote: > >>>>>>>On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote: > >>>>>>>>On 11/30/2015 04:10 PM, Brian Foster wrote: > >>>>>... > >>>>>>>The agsize/agcount mkfs-time heuristics change depending on the type of > >>>>>>>storage. A single AG can be up to 1TB and if the fs is not considered > >>>>>>>"multidisk" (e.g., no stripe unit/width is defined), 4 AGs is the > >>>>>>>default up to 4TB. If a stripe unit is set, the agsize/agcount is > >>>>>>>adjusted depending on the size of the overall volume (see > >>>>>>>xfsprogs-dev/mkfs/xfs_mkfs.c:calc_default_ag_geometry() for details). > >>>>>>We'll experiment with this. Surely it depends on more than the amount of > >>>>>>storage? If you have a high op rate you'll be more likely to excite > >>>>>>contention, no? > >>>>>> > >>>>>Sure. The absolute optimal configuration for your workload probably > >>>>>depends on more than storage size, but mkfs doesn't have that > >>>>>information. In general, it tries to use the most reasonable > >>>>>configuration based on the storage and expected workload. If you want to > >>>>>tweak it beyond that, indeed, the best bet is to experiment with what > >>>>>works. > >>>>We will do that. > >>>> > >>>>>>>>Are those locks held around I/O, or just CPU operations, or a mix? > >>>>>>>I believe it's a mix of modifications and I/O, though it looks like some > >>>>>>>of the I/O cases don't necessarily wait on the lock. E.g., the AIL > >>>>>>>pushing case will trylock and defer to the next list iteration if the > >>>>>>>buffer is busy. > >>>>>>> > >>>>>>Ok. For us sleeping in io_submit() is death because we have no other thread > >>>>>>on that core to take its place. > >>>>>> > >>>>>The above is with regard to metadata I/O, whereas io_submit() is > >>>>>obviously for user I/O. > >>>>Won't io_submit() also trigger metadata I/O? Or is that all deferred to > >>>>async tasks? I don't mind them blocking each other as long as they let my > >>>>io_submit alone. > >>>> > >>>Yeah, it can trigger metadata reads, force the log (the stale buffer > >>>example) or push the AIL (wait on log space). Metadata changes made > >>>directly via your I/O request are logged/committed via transactions, > >>>which are generally processed asynchronously from that point on. > >>> > >>>>> io_submit() can probably block in a variety of > >>>>>places afaict... it might have to read in the inode extent map, allocate > >>>>>blocks, take inode/ag locks, reserve log space for transactions, etc. > >>>>Any chance of changing all that to be asynchronous? Doesn't sound too hard, > >>>>if somebody else has to do it. > >>>> > >>>I'm not following... if the fs needs to read in the inode extent map to > >>>prepare for an allocation, what else can the thread do but wait? Are you > >>>suggesting the request kick off whatever the blocking action happens to > >>>be asynchronously and return with an error such that the request can be > >>>retried later? > >>Not quite, it should be invisible to the caller. > >> > >>That is, the code called by io_submit() (file_operations::write_iter, it > >>seems to be called today) can kick off this operation and have it continue > >>from where it left off. > >> > >Isn't that generally what happens today? > > You tell me. According to $subject, apparently not enough. Maybe we're > triggering it more often, or we suffer more when it does trigger (the latter > probably more likely). > The original mail describes looking at the sched:sched_switch tracepoint which on a quick look, appears to fire whenever a cpu context switch occurs. This likely triggers any time we wait on an I/O or a contended lock (among other situations I'm sure), and it signifies that something else is going to execute in our place until this thread can make progress. > > We submit an I/O which is > >asynchronous in nature and wait on a completion, which causes the cpu to > >schedule and execute another task until the completion is set by I/O > >completion (via an async callback). At that point, the issuing thread > >continues where it left off. I suspect I'm missing something... can you > >elaborate on what you'd do differently here (and how it helps)? > > Just apply the same technique everywhere: convert locks to trylock + > schedule a continuation on failure. > I'm certainly not an expert on the kernel scheduling, locking and serialization mechanisms, but my understanding is that most things outside of spin locks are reschedule points. For example, the wait_for_completion() calls XFS uses to wait on I/O boil down to schedule_timeout() calls. Buffer locks are implemented as semaphores and down() can end up in the same place. Brian > > > >>Seastar (the async user framework which we use to drive xfs) makes writing > >>code like this easy, using continuations; but of course from ordinary > >>threaded code it can be quite hard. > >> > >>btw, there was an attempt to make ext[34] async using this method, but I > >>think it was ripped out. Yes, the mortal remains can still be seen with > >>'git grep EIOCBQUEUED'. > >> > >>>>>It sounds to me that first and foremost you want to make sure you don't > >>>>>have however many parallel operations you typically have running > >>>>>contending on the same inodes or AGs. Hint: creating files under > >>>>>separate subdirectories is a quick and easy way to allocate inodes under > >>>>>separate AGs (the agno is encoded into the upper bits of the inode > >>>>>number). > >>>>Unfortunately our directory layout cannot be changed. And doesn't this > >>>>require having agcount == O(number of active files)? That is easily in the > >>>>thousands. > >>>> > >>>I think Glauber's O(nr_cpus) comment is probably the more likely > >>>ballpark, but really it's something you'll probably just need to test to > >>>see how far you need to go to avoid AG contention. > >>> > >>>I'm primarily throwing the subdir thing out there for testing purposes. > >>>It's just an easy way to create inodes in a bunch of separate AGs so you > >>>can determine whether/how much it really helps with modified AG counts. > >>>I don't know enough about your application design to really comment on > >>>that... > >>We have O(cpus) shards that operate independently. Each shard writes 32MB > >>commitlog files (that are pre-truncated to 32MB to allow concurrent writes > >>without blocking); the files are then flushed and closed, and later removed. > >>In parallel there are sequential writes and reads of large files using 128kB > >>buffers), as well as random reads. Files are immutable (append-only), and > >>if a file is being written, it is not concurrently read. In general files > >>are not shared across shards. All I/O is async and O_DIRECT. open(), > >>truncate(), fdatasync(), and friends are called from a helper thread. > >> > >>As far as I can tell it should a very friendly load for XFS and SSDs. > >> > >>>>> Reducing the frequency of block allocation/frees might also be > >>>>>another help (e.g., preallocate and reuse files, > >>>>Isn't that discouraged for SSDs? > >>>> > >>>Perhaps, if you're referring to the fact that the blocks are never freed > >>>and thus never discarded..? Are you running fstrim? > >>mount -o discard. And yes, overwrites are supposedly more expensive than > >>trim old data + allocate new data, but maybe if you compare it with the work > >>XFS has to do, perhaps the tradeoff is bad. > >> > >Ok, my understanding is that '-o discard' is not recommended in favor of > >periodic fstrim for performance reasons, but that may or may not still > >be the case. > > I understand that most SSDs have queued trim these days, but maybe I'm > optimistic. > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-01 16:29 ` Brian Foster @ 2015-12-01 17:09 ` Avi Kivity 2015-12-01 18:03 ` Carlos Maiolino 2015-12-01 18:51 ` Brian Foster 0 siblings, 2 replies; 58+ messages in thread From: Avi Kivity @ 2015-12-01 17:09 UTC (permalink / raw) To: Brian Foster; +Cc: Glauber Costa, xfs On 12/01/2015 06:29 PM, Brian Foster wrote: > On Tue, Dec 01, 2015 at 06:08:51PM +0200, Avi Kivity wrote: >> >> On 12/01/2015 06:01 PM, Brian Foster wrote: >>> On Tue, Dec 01, 2015 at 05:22:38PM +0200, Avi Kivity wrote: >>>> On 12/01/2015 04:56 PM, Brian Foster wrote: >>>>> On Tue, Dec 01, 2015 at 03:58:28PM +0200, Avi Kivity wrote: >>>>>> On 12/01/2015 03:11 PM, Brian Foster wrote: >>>>>>> On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote: >>>>>>>> On 11/30/2015 06:14 PM, Brian Foster wrote: >>>>>>>>> On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote: >>>>>>>>>> On 11/30/2015 04:10 PM, Brian Foster wrote: >>>>>>> ... >>>>>>>>> The agsize/agcount mkfs-time heuristics change depending on the type of >>>>>>>>> storage. A single AG can be up to 1TB and if the fs is not considered >>>>>>>>> "multidisk" (e.g., no stripe unit/width is defined), 4 AGs is the >>>>>>>>> default up to 4TB. If a stripe unit is set, the agsize/agcount is >>>>>>>>> adjusted depending on the size of the overall volume (see >>>>>>>>> xfsprogs-dev/mkfs/xfs_mkfs.c:calc_default_ag_geometry() for details). >>>>>>>> We'll experiment with this. Surely it depends on more than the amount of >>>>>>>> storage? If you have a high op rate you'll be more likely to excite >>>>>>>> contention, no? >>>>>>>> >>>>>>> Sure. The absolute optimal configuration for your workload probably >>>>>>> depends on more than storage size, but mkfs doesn't have that >>>>>>> information. In general, it tries to use the most reasonable >>>>>>> configuration based on the storage and expected workload. If you want to >>>>>>> tweak it beyond that, indeed, the best bet is to experiment with what >>>>>>> works. >>>>>> We will do that. >>>>>> >>>>>>>>>> Are those locks held around I/O, or just CPU operations, or a mix? >>>>>>>>> I believe it's a mix of modifications and I/O, though it looks like some >>>>>>>>> of the I/O cases don't necessarily wait on the lock. E.g., the AIL >>>>>>>>> pushing case will trylock and defer to the next list iteration if the >>>>>>>>> buffer is busy. >>>>>>>>> >>>>>>>> Ok. For us sleeping in io_submit() is death because we have no other thread >>>>>>>> on that core to take its place. >>>>>>>> >>>>>>> The above is with regard to metadata I/O, whereas io_submit() is >>>>>>> obviously for user I/O. >>>>>> Won't io_submit() also trigger metadata I/O? Or is that all deferred to >>>>>> async tasks? I don't mind them blocking each other as long as they let my >>>>>> io_submit alone. >>>>>> >>>>> Yeah, it can trigger metadata reads, force the log (the stale buffer >>>>> example) or push the AIL (wait on log space). Metadata changes made >>>>> directly via your I/O request are logged/committed via transactions, >>>>> which are generally processed asynchronously from that point on. >>>>> >>>>>>> io_submit() can probably block in a variety of >>>>>>> places afaict... it might have to read in the inode extent map, allocate >>>>>>> blocks, take inode/ag locks, reserve log space for transactions, etc. >>>>>> Any chance of changing all that to be asynchronous? Doesn't sound too hard, >>>>>> if somebody else has to do it. >>>>>> >>>>> I'm not following... if the fs needs to read in the inode extent map to >>>>> prepare for an allocation, what else can the thread do but wait? Are you >>>>> suggesting the request kick off whatever the blocking action happens to >>>>> be asynchronously and return with an error such that the request can be >>>>> retried later? >>>> Not quite, it should be invisible to the caller. >>>> >>>> That is, the code called by io_submit() (file_operations::write_iter, it >>>> seems to be called today) can kick off this operation and have it continue >>> >from where it left off. >>> Isn't that generally what happens today? >> You tell me. According to $subject, apparently not enough. Maybe we're >> triggering it more often, or we suffer more when it does trigger (the latter >> probably more likely). >> > The original mail describes looking at the sched:sched_switch tracepoint > which on a quick look, appears to fire whenever a cpu context switch > occurs. This likely triggers any time we wait on an I/O or a contended > lock (among other situations I'm sure), and it signifies that something > else is going to execute in our place until this thread can make > progress. For us, nothing else can execute in our place, we usually have exactly one thread per logical core. So we are heavily dependent on io_submit not sleeping. The case of a contended lock is, to me, less worrying. It can be reduced by using more allocation groups, which is apparently the shared resource under contention. The case of waiting for I/O is much more worrying, because I/O latency are much higher. But it seems like most of the DIO path does not trigger locking around I/O (and we are careful to avoid the ones that do, like writing beyond eof). (sorry for repeating myself, I have the feeling we are talking past each other and want to be on the same page) > >>> We submit an I/O which is >>> asynchronous in nature and wait on a completion, which causes the cpu to >>> schedule and execute another task until the completion is set by I/O >>> completion (via an async callback). At that point, the issuing thread >>> continues where it left off. I suspect I'm missing something... can you >>> elaborate on what you'd do differently here (and how it helps)? >> Just apply the same technique everywhere: convert locks to trylock + >> schedule a continuation on failure. >> > I'm certainly not an expert on the kernel scheduling, locking and > serialization mechanisms, but my understanding is that most things > outside of spin locks are reschedule points. For example, the > wait_for_completion() calls XFS uses to wait on I/O boil down to > schedule_timeout() calls. Buffer locks are implemented as semaphores and > down() can end up in the same place. But, for the most part, XFS seems to be able to avoid sleeping. The call to __blockdev_direct_IO only launches the I/O, so any locking is only around cpu operations and, unless there is contention, won't cause us to sleep in io_submit(). Trying to follow the code, it looks like xfs_get_blocks_direct (and __blockdev_direct_IO's get_block parameter in general) is synchronous, so we're just lucky to have everything in cache. If it isn't, we block right there. I really hope I'm misreading this and some other magic is happening elsewhere instead of this. > Brian > >>>> Seastar (the async user framework which we use to drive xfs) makes writing >>>> code like this easy, using continuations; but of course from ordinary >>>> threaded code it can be quite hard. >>>> >>>> btw, there was an attempt to make ext[34] async using this method, but I >>>> think it was ripped out. Yes, the mortal remains can still be seen with >>>> 'git grep EIOCBQUEUED'. >>>> >>>>>>> It sounds to me that first and foremost you want to make sure you don't >>>>>>> have however many parallel operations you typically have running >>>>>>> contending on the same inodes or AGs. Hint: creating files under >>>>>>> separate subdirectories is a quick and easy way to allocate inodes under >>>>>>> separate AGs (the agno is encoded into the upper bits of the inode >>>>>>> number). >>>>>> Unfortunately our directory layout cannot be changed. And doesn't this >>>>>> require having agcount == O(number of active files)? That is easily in the >>>>>> thousands. >>>>>> >>>>> I think Glauber's O(nr_cpus) comment is probably the more likely >>>>> ballpark, but really it's something you'll probably just need to test to >>>>> see how far you need to go to avoid AG contention. >>>>> >>>>> I'm primarily throwing the subdir thing out there for testing purposes. >>>>> It's just an easy way to create inodes in a bunch of separate AGs so you >>>>> can determine whether/how much it really helps with modified AG counts. >>>>> I don't know enough about your application design to really comment on >>>>> that... >>>> We have O(cpus) shards that operate independently. Each shard writes 32MB >>>> commitlog files (that are pre-truncated to 32MB to allow concurrent writes >>>> without blocking); the files are then flushed and closed, and later removed. >>>> In parallel there are sequential writes and reads of large files using 128kB >>>> buffers), as well as random reads. Files are immutable (append-only), and >>>> if a file is being written, it is not concurrently read. In general files >>>> are not shared across shards. All I/O is async and O_DIRECT. open(), >>>> truncate(), fdatasync(), and friends are called from a helper thread. >>>> >>>> As far as I can tell it should a very friendly load for XFS and SSDs. >>>> >>>>>>> Reducing the frequency of block allocation/frees might also be >>>>>>> another help (e.g., preallocate and reuse files, >>>>>> Isn't that discouraged for SSDs? >>>>>> >>>>> Perhaps, if you're referring to the fact that the blocks are never freed >>>>> and thus never discarded..? Are you running fstrim? >>>> mount -o discard. And yes, overwrites are supposedly more expensive than >>>> trim old data + allocate new data, but maybe if you compare it with the work >>>> XFS has to do, perhaps the tradeoff is bad. >>>> >>> Ok, my understanding is that '-o discard' is not recommended in favor of >>> periodic fstrim for performance reasons, but that may or may not still >>> be the case. >> I understand that most SSDs have queued trim these days, but maybe I'm >> optimistic. >> _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-01 17:09 ` Avi Kivity @ 2015-12-01 18:03 ` Carlos Maiolino 2015-12-01 19:07 ` Avi Kivity 2015-12-01 18:51 ` Brian Foster 1 sibling, 1 reply; 58+ messages in thread From: Carlos Maiolino @ 2015-12-01 18:03 UTC (permalink / raw) To: Avi Kivity; +Cc: Glauber Costa, xfs Hi Avi, > >else is going to execute in our place until this thread can make > >progress. > > For us, nothing else can execute in our place, we usually have exactly one > thread per logical core. So we are heavily dependent on io_submit not > sleeping. > > The case of a contended lock is, to me, less worrying. It can be reduced by > using more allocation groups, which is apparently the shared resource under > contention. > I apologize if I misread your previous comments, but, IIRC you said you can't change the directory structure your application is using, and IIRC your application does not spread files across several directories. XFS spread files across the allocation groups, based on the directory these files are created, trying to keep files as close as possible from their metadata. Directories are spreaded across the AGs in a 'round-robin' way, each new directory, will be created in the next allocation group, and, xfs will try to allocate the files in the same AG as its parent directory. (Take a look at the 'rotorstep' sysctl option for xfs). So, unless you have the files distributed across enough directories, increasing the number of allocation groups may not change the lock contention you're facing in this case. I really don't remember if it has been mentioned already, but if not, it might be worth to take this point in consideration. anyway, just my 0.02 > The case of waiting for I/O is much more worrying, because I/O latency are > much higher. But it seems like most of the DIO path does not trigger > locking around I/O (and we are careful to avoid the ones that do, like > writing beyond eof). > > (sorry for repeating myself, I have the feeling we are talking past each > other and want to be on the same page) > > > > >>> We submit an I/O which is > >>>asynchronous in nature and wait on a completion, which causes the cpu to > >>>schedule and execute another task until the completion is set by I/O > >>>completion (via an async callback). At that point, the issuing thread > >>>continues where it left off. I suspect I'm missing something... can you > >>>elaborate on what you'd do differently here (and how it helps)? > >>Just apply the same technique everywhere: convert locks to trylock + > >>schedule a continuation on failure. > >> > >I'm certainly not an expert on the kernel scheduling, locking and > >serialization mechanisms, but my understanding is that most things > >outside of spin locks are reschedule points. For example, the > >wait_for_completion() calls XFS uses to wait on I/O boil down to > >schedule_timeout() calls. Buffer locks are implemented as semaphores and > >down() can end up in the same place. > > But, for the most part, XFS seems to be able to avoid sleeping. The call to > __blockdev_direct_IO only launches the I/O, so any locking is only around > cpu operations and, unless there is contention, won't cause us to sleep in > io_submit(). > > Trying to follow the code, it looks like xfs_get_blocks_direct (and > __blockdev_direct_IO's get_block parameter in general) is synchronous, so > we're just lucky to have everything in cache. If it isn't, we block right > there. I really hope I'm misreading this and some other magic is happening > elsewhere instead of this. > > >Brian > > > >>>>Seastar (the async user framework which we use to drive xfs) makes writing > >>>>code like this easy, using continuations; but of course from ordinary > >>>>threaded code it can be quite hard. > >>>> > >>>>btw, there was an attempt to make ext[34] async using this method, but I > >>>>think it was ripped out. Yes, the mortal remains can still be seen with > >>>>'git grep EIOCBQUEUED'. > >>>> > >>>>>>>It sounds to me that first and foremost you want to make sure you don't > >>>>>>>have however many parallel operations you typically have running > >>>>>>>contending on the same inodes or AGs. Hint: creating files under > >>>>>>>separate subdirectories is a quick and easy way to allocate inodes under > >>>>>>>separate AGs (the agno is encoded into the upper bits of the inode > >>>>>>>number). > >>>>>>Unfortunately our directory layout cannot be changed. And doesn't this > >>>>>>require having agcount == O(number of active files)? That is easily in the > >>>>>>thousands. > >>>>>> > >>>>>I think Glauber's O(nr_cpus) comment is probably the more likely > >>>>>ballpark, but really it's something you'll probably just need to test to > >>>>>see how far you need to go to avoid AG contention. > >>>>> > >>>>>I'm primarily throwing the subdir thing out there for testing purposes. > >>>>>It's just an easy way to create inodes in a bunch of separate AGs so you > >>>>>can determine whether/how much it really helps with modified AG counts. > >>>>>I don't know enough about your application design to really comment on > >>>>>that... > >>>>We have O(cpus) shards that operate independently. Each shard writes 32MB > >>>>commitlog files (that are pre-truncated to 32MB to allow concurrent writes > >>>>without blocking); the files are then flushed and closed, and later removed. > >>>>In parallel there are sequential writes and reads of large files using 128kB > >>>>buffers), as well as random reads. Files are immutable (append-only), and > >>>>if a file is being written, it is not concurrently read. In general files > >>>>are not shared across shards. All I/O is async and O_DIRECT. open(), > >>>>truncate(), fdatasync(), and friends are called from a helper thread. > >>>> > >>>>As far as I can tell it should a very friendly load for XFS and SSDs. > >>>> > >>>>>>> Reducing the frequency of block allocation/frees might also be > >>>>>>>another help (e.g., preallocate and reuse files, > >>>>>>Isn't that discouraged for SSDs? > >>>>>> > >>>>>Perhaps, if you're referring to the fact that the blocks are never freed > >>>>>and thus never discarded..? Are you running fstrim? > >>>>mount -o discard. And yes, overwrites are supposedly more expensive than > >>>>trim old data + allocate new data, but maybe if you compare it with the work > >>>>XFS has to do, perhaps the tradeoff is bad. > >>>> > >>>Ok, my understanding is that '-o discard' is not recommended in favor of > >>>periodic fstrim for performance reasons, but that may or may not still > >>>be the case. > >>I understand that most SSDs have queued trim these days, but maybe I'm > >>optimistic. > >> > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs -- Carlos _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-01 18:03 ` Carlos Maiolino @ 2015-12-01 19:07 ` Avi Kivity 2015-12-01 21:19 ` Dave Chinner 0 siblings, 1 reply; 58+ messages in thread From: Avi Kivity @ 2015-12-01 19:07 UTC (permalink / raw) To: Glauber Costa, xfs On 12/01/2015 08:03 PM, Carlos Maiolino wrote: > Hi Avi, > >>> else is going to execute in our place until this thread can make >>> progress. >> For us, nothing else can execute in our place, we usually have exactly one >> thread per logical core. So we are heavily dependent on io_submit not >> sleeping. >> >> The case of a contended lock is, to me, less worrying. It can be reduced by >> using more allocation groups, which is apparently the shared resource under >> contention. >> > I apologize if I misread your previous comments, but, IIRC you said you can't > change the directory structure your application is using, and IIRC your > application does not spread files across several directories. I miswrote somewhat: the application writes data files and commitlog files. The data file directory structure is fixed due to compatibility concerns (it is not a single directory, but some workloads will see most access on files in a single directory. The commitlog directory structure is more relaxed, and we can split it to a directory per shard (=cpu) or something else. If worst comes to worst, we'll hack around this and distribute the data files into more directories, and provide some hack for compatibility. > XFS spread files across the allocation groups, based on the directory these > files are created, Idea: create the files in some subdirectory, and immediately move them to their required location. > trying to keep files as close as possible from their > metadata. This is pointless for an SSD. Perhaps XFS should randomize the ag on nonrotational media instead. > Directories are spreaded across the AGs in a 'round-robin' way, each > new directory, will be created in the next allocation group, and, xfs will try > to allocate the files in the same AG as its parent directory. (Take a look at > the 'rotorstep' sysctl option for xfs). > > So, unless you have the files distributed across enough directories, increasing > the number of allocation groups may not change the lock contention you're > facing in this case. > > I really don't remember if it has been mentioned already, but if not, it might > be worth to take this point in consideration. Thanks. I think you should really consider randomizing the ag for SSDs, and meanwhile, we can just use the creation-directory hack to get the same effect, at the cost of an extra system call. So at least for this problem, there is a solution. > anyway, just my 0.02 > >> The case of waiting for I/O is much more worrying, because I/O latency are >> much higher. But it seems like most of the DIO path does not trigger >> locking around I/O (and we are careful to avoid the ones that do, like >> writing beyond eof). >> >> (sorry for repeating myself, I have the feeling we are talking past each >> other and want to be on the same page) >> >>>>> We submit an I/O which is >>>>> asynchronous in nature and wait on a completion, which causes the cpu to >>>>> schedule and execute another task until the completion is set by I/O >>>>> completion (via an async callback). At that point, the issuing thread >>>>> continues where it left off. I suspect I'm missing something... can you >>>>> elaborate on what you'd do differently here (and how it helps)? >>>> Just apply the same technique everywhere: convert locks to trylock + >>>> schedule a continuation on failure. >>>> >>> I'm certainly not an expert on the kernel scheduling, locking and >>> serialization mechanisms, but my understanding is that most things >>> outside of spin locks are reschedule points. For example, the >>> wait_for_completion() calls XFS uses to wait on I/O boil down to >>> schedule_timeout() calls. Buffer locks are implemented as semaphores and >>> down() can end up in the same place. >> But, for the most part, XFS seems to be able to avoid sleeping. The call to >> __blockdev_direct_IO only launches the I/O, so any locking is only around >> cpu operations and, unless there is contention, won't cause us to sleep in >> io_submit(). >> >> Trying to follow the code, it looks like xfs_get_blocks_direct (and >> __blockdev_direct_IO's get_block parameter in general) is synchronous, so >> we're just lucky to have everything in cache. If it isn't, we block right >> there. I really hope I'm misreading this and some other magic is happening >> elsewhere instead of this. >> >>> Brian >>> >>>>>> Seastar (the async user framework which we use to drive xfs) makes writing >>>>>> code like this easy, using continuations; but of course from ordinary >>>>>> threaded code it can be quite hard. >>>>>> >>>>>> btw, there was an attempt to make ext[34] async using this method, but I >>>>>> think it was ripped out. Yes, the mortal remains can still be seen with >>>>>> 'git grep EIOCBQUEUED'. >>>>>> >>>>>>>>> It sounds to me that first and foremost you want to make sure you don't >>>>>>>>> have however many parallel operations you typically have running >>>>>>>>> contending on the same inodes or AGs. Hint: creating files under >>>>>>>>> separate subdirectories is a quick and easy way to allocate inodes under >>>>>>>>> separate AGs (the agno is encoded into the upper bits of the inode >>>>>>>>> number). >>>>>>>> Unfortunately our directory layout cannot be changed. And doesn't this >>>>>>>> require having agcount == O(number of active files)? That is easily in the >>>>>>>> thousands. >>>>>>>> >>>>>>> I think Glauber's O(nr_cpus) comment is probably the more likely >>>>>>> ballpark, but really it's something you'll probably just need to test to >>>>>>> see how far you need to go to avoid AG contention. >>>>>>> >>>>>>> I'm primarily throwing the subdir thing out there for testing purposes. >>>>>>> It's just an easy way to create inodes in a bunch of separate AGs so you >>>>>>> can determine whether/how much it really helps with modified AG counts. >>>>>>> I don't know enough about your application design to really comment on >>>>>>> that... >>>>>> We have O(cpus) shards that operate independently. Each shard writes 32MB >>>>>> commitlog files (that are pre-truncated to 32MB to allow concurrent writes >>>>>> without blocking); the files are then flushed and closed, and later removed. >>>>>> In parallel there are sequential writes and reads of large files using 128kB >>>>>> buffers), as well as random reads. Files are immutable (append-only), and >>>>>> if a file is being written, it is not concurrently read. In general files >>>>>> are not shared across shards. All I/O is async and O_DIRECT. open(), >>>>>> truncate(), fdatasync(), and friends are called from a helper thread. >>>>>> >>>>>> As far as I can tell it should a very friendly load for XFS and SSDs. >>>>>> >>>>>>>>> Reducing the frequency of block allocation/frees might also be >>>>>>>>> another help (e.g., preallocate and reuse files, >>>>>>>> Isn't that discouraged for SSDs? >>>>>>>> >>>>>>> Perhaps, if you're referring to the fact that the blocks are never freed >>>>>>> and thus never discarded..? Are you running fstrim? >>>>>> mount -o discard. And yes, overwrites are supposedly more expensive than >>>>>> trim old data + allocate new data, but maybe if you compare it with the work >>>>>> XFS has to do, perhaps the tradeoff is bad. >>>>>> >>>>> Ok, my understanding is that '-o discard' is not recommended in favor of >>>>> periodic fstrim for performance reasons, but that may or may not still >>>>> be the case. >>>> I understand that most SSDs have queued trim these days, but maybe I'm >>>> optimistic. >>>> >> _______________________________________________ >> xfs mailing list >> xfs@oss.sgi.com >> http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-01 19:07 ` Avi Kivity @ 2015-12-01 21:19 ` Dave Chinner 2015-12-01 21:38 ` Avi Kivity 0 siblings, 1 reply; 58+ messages in thread From: Dave Chinner @ 2015-12-01 21:19 UTC (permalink / raw) To: Avi Kivity; +Cc: Glauber Costa, xfs On Tue, Dec 01, 2015 at 09:07:14PM +0200, Avi Kivity wrote: > On 12/01/2015 08:03 PM, Carlos Maiolino wrote: > >Hi Avi, > > > >>>else is going to execute in our place until this thread can make > >>>progress. > >>For us, nothing else can execute in our place, we usually have exactly one > >>thread per logical core. So we are heavily dependent on io_submit not > >>sleeping. > >> > >>The case of a contended lock is, to me, less worrying. It can be reduced by > >>using more allocation groups, which is apparently the shared resource under > >>contention. > >> > >I apologize if I misread your previous comments, but, IIRC you said you can't > >change the directory structure your application is using, and IIRC your > >application does not spread files across several directories. > > I miswrote somewhat: the application writes data files and commitlog > files. The data file directory structure is fixed due to > compatibility concerns (it is not a single directory, but some > workloads will see most access on files in a single directory. The > commitlog directory structure is more relaxed, and we can split it > to a directory per shard (=cpu) or something else. > > If worst comes to worst, we'll hack around this and distribute the > data files into more directories, and provide some hack for > compatibility. > > >XFS spread files across the allocation groups, based on the directory these > >files are created, > > Idea: create the files in some subdirectory, and immediately move > them to their required location. See xfs_fsr. > > > trying to keep files as close as possible from their > >metadata. > > This is pointless for an SSD. Perhaps XFS should randomize the ag on > nonrotational media instead. Actually, no, it is not pointless. SSDs do not require optimisation for minimal seek time, but data locality is still just as important as spinning disks, if not moreso. Why? Because the garbage collection routines in the SSDs are all about locality and we can't drive garbage collection effectively via discard operations if the filesystem is not keeping temporally related files close together in it's block address space. e.g. If the files in a directory are all close together, and the directory is removed, we then leave a big empty contiguous region in the filesystem free space map, and when we send discards over that we end up with a single big trim and the drive handles that far more effectively than lots of little trims (i.e. one per file) that the drive cannot do anything useful with because they are all smaller than the internal SSD page/block sizes and so get ignored. This is one of the reasons fstrim is so much more efficient and effective than using the discard mount option. And, well, XFS is designed to operate on storage devices made up of more than one drive, so the way AGs are selected is designed to given long term load balancing (both for space usage and instantenous performance). With the existing algorithms we've not had any issues with SSD lifetimes, long term performance degradation, etc, so there's no evidence that we actually need to change the fundamental allocation algorithms specially for SSDs. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-01 21:19 ` Dave Chinner @ 2015-12-01 21:38 ` Avi Kivity 2015-12-01 23:06 ` Dave Chinner 0 siblings, 1 reply; 58+ messages in thread From: Avi Kivity @ 2015-12-01 21:38 UTC (permalink / raw) To: Dave Chinner; +Cc: Glauber Costa, xfs On 12/01/2015 11:19 PM, Dave Chinner wrote: > On Tue, Dec 01, 2015 at 09:07:14PM +0200, Avi Kivity wrote: >> On 12/01/2015 08:03 PM, Carlos Maiolino wrote: >>> Hi Avi, >>> >>>>> else is going to execute in our place until this thread can make >>>>> progress. >>>> For us, nothing else can execute in our place, we usually have exactly one >>>> thread per logical core. So we are heavily dependent on io_submit not >>>> sleeping. >>>> >>>> The case of a contended lock is, to me, less worrying. It can be reduced by >>>> using more allocation groups, which is apparently the shared resource under >>>> contention. >>>> >>> I apologize if I misread your previous comments, but, IIRC you said you can't >>> change the directory structure your application is using, and IIRC your >>> application does not spread files across several directories. >> I miswrote somewhat: the application writes data files and commitlog >> files. The data file directory structure is fixed due to >> compatibility concerns (it is not a single directory, but some >> workloads will see most access on files in a single directory. The >> commitlog directory structure is more relaxed, and we can split it >> to a directory per shard (=cpu) or something else. >> >> If worst comes to worst, we'll hack around this and distribute the >> data files into more directories, and provide some hack for >> compatibility. >> >>> XFS spread files across the allocation groups, based on the directory these >>> files are created, >> Idea: create the files in some subdirectory, and immediately move >> them to their required location. > See xfs_fsr. Can you elaborate? I don't see how it is applicable. My hack involves creating the file in a random directory, and while it is still zero sized, move it to its final directory. This is simply to defeat the ag selection heuristic. No data is copied. >>> trying to keep files as close as possible from their >>> metadata. >> This is pointless for an SSD. Perhaps XFS should randomize the ag on >> nonrotational media instead. > Actually, no, it is not pointless. SSDs do not require optimisation > for minimal seek time, but data locality is still just as important > as spinning disks, if not moreso. Why? Because the garbage > collection routines in the SSDs are all about locality and we can't > drive garbage collection effectively via discard operations if the > filesystem is not keeping temporally related files close together in > it's block address space. In my case, files in the same directory are not temporally related. But I understand where the heuristic comes from. Maybe an ioctl to set a directory attribute "the files in this directory are not temporally related"? I imagine this will be useful for many server applications. > e.g. If the files in a directory are all close together, and the > directory is removed, we then leave a big empty contiguous region in > the filesystem free space map, and when we send discards over that > we end up with a single big trim and the drive handles that far more Would this not be defeated if a directory that happens to share the allocation group gets populated simultaneously? > effectively than lots of little trims (i.e. one per file) that the > drive cannot do anything useful with because they are all smaller > than the internal SSD page/block sizes and so get ignored. This is > one of the reasons fstrim is so much more efficient and effective > than using the discard mount option. In my use case, the files are fairly large, and there is constant rewriting (not in-place: files are read, merged, and written back). So I'm worried an fstrim can happen too late. > > And, well, XFS is designed to operate on storage devices made up of > more than one drive, so the way AGs are selected is designed to > given long term load balancing (both for space usage and > instantenous performance). With the existing algorithms we've not > had any issues with SSD lifetimes, long term performance > degradation, etc, so there's no evidence that we actually need to > change the fundamental allocation algorithms specially for SSDs. > Ok. Maybe the SSDs can deal with untrimmed overwrites efficiently, provided the io sizes are large enough. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-01 21:38 ` Avi Kivity @ 2015-12-01 23:06 ` Dave Chinner 2015-12-02 9:02 ` Avi Kivity 0 siblings, 1 reply; 58+ messages in thread From: Dave Chinner @ 2015-12-01 23:06 UTC (permalink / raw) To: Avi Kivity; +Cc: Glauber Costa, xfs On Tue, Dec 01, 2015 at 11:38:29PM +0200, Avi Kivity wrote: > On 12/01/2015 11:19 PM, Dave Chinner wrote: > >On Tue, Dec 01, 2015 at 09:07:14PM +0200, Avi Kivity wrote: > >>On 12/01/2015 08:03 PM, Carlos Maiolino wrote: > >>>Hi Avi, > >>> > >>>>>else is going to execute in our place until this thread can make > >>>>>progress. > >>>>For us, nothing else can execute in our place, we usually have exactly one > >>>>thread per logical core. So we are heavily dependent on io_submit not > >>>>sleeping. > >>>> > >>>>The case of a contended lock is, to me, less worrying. It can be reduced by > >>>>using more allocation groups, which is apparently the shared resource under > >>>>contention. > >>>> > >>>I apologize if I misread your previous comments, but, IIRC you said you can't > >>>change the directory structure your application is using, and IIRC your > >>>application does not spread files across several directories. > >>I miswrote somewhat: the application writes data files and commitlog > >>files. The data file directory structure is fixed due to > >>compatibility concerns (it is not a single directory, but some > >>workloads will see most access on files in a single directory. The > >>commitlog directory structure is more relaxed, and we can split it > >>to a directory per shard (=cpu) or something else. > >> > >>If worst comes to worst, we'll hack around this and distribute the > >>data files into more directories, and provide some hack for > >>compatibility. > >> > >>>XFS spread files across the allocation groups, based on the directory these > >>>files are created, > >>Idea: create the files in some subdirectory, and immediately move > >>them to their required location. > >See xfs_fsr. > > Can you elaborate? I don't see how it is applicable. Just pointing out that this is what xfs_fsr does to control locality of allocation for files it is defragmenting. Except that rather than moving files, it uses XFS_IOC_SWAPEXT to switch the data between two inodes atomically... > My hack involves creating the file in a random directory, and while > it is still zero sized, move it to its final directory. This is > simply to defeat the ag selection heuristic. Which you really don't want to do. > >>> trying to keep files as close as possible from their > >>>metadata. > >>This is pointless for an SSD. Perhaps XFS should randomize the ag on > >>nonrotational media instead. > >Actually, no, it is not pointless. SSDs do not require optimisation > >for minimal seek time, but data locality is still just as important > >as spinning disks, if not moreso. Why? Because the garbage > >collection routines in the SSDs are all about locality and we can't > >drive garbage collection effectively via discard operations if the > >filesystem is not keeping temporally related files close together in > >it's block address space. > > In my case, files in the same directory are not temporally related. > But I understand where the heuristic comes from. > > Maybe an ioctl to set a directory attribute "the files in this > directory are not temporally related"? And exactly what does that gain us? Exactly what problem are you trying to solve by manipulating file locality that can't be solved by existing knobs and config options? Perhaps you'd like to read up on how the inode32 allocator behaves? > >e.g. If the files in a directory are all close together, and the > >directory is removed, we then leave a big empty contiguous region in > >the filesystem free space map, and when we send discards over that > >we end up with a single big trim and the drive handles that far more > > Would this not be defeated if a directory that happens to share the > allocation group gets populated simultaneously? Sure. But this sort of thing is rare in the real world, and when they do occur, it generally only takes small tweaks to algorithms and layouts make them go away. I don't care to bikeshed about theoretical problems - I'm in the business of finding the root cause of the problems users are having and solving those problems. So far what you've given us is a ball of "there's blocking in AIO submission", and the only one that is clear cut is the timestamp update. Go back and categorise the types of blocking that you are seeing - whether it be on the AGIs during inode manipulation, one the AGFs becuse of concurrent extent allocation, on log forces because of slow discards in the transcation completion, on the transaction subsystem because of a lack of log space for concurrent reservations, etc. And then determine if changing the layout of the filesystem (e.g. number of AGs, size of log, etc) and different mount options (e.g. turning off discard, using inode32 allocator, etc) make any difference to the blocking issues you are seeing. Once we know which of the different algorithms is causing the blocking issues, we'll know a lot more about why we're having problems and a better idea of what problems we actually need to solve. > >effectively than lots of little trims (i.e. one per file) that the > >drive cannot do anything useful with because they are all smaller > >than the internal SSD page/block sizes and so get ignored. This is > >one of the reasons fstrim is so much more efficient and effective > >than using the discard mount option. > > In my use case, the files are fairly large, and there is constant > rewriting (not in-place: files are read, merged, and written back). > So I'm worried an fstrim can happen too late. Have you measured the SSD performance degradation over time due to large overwrites? If not, then again it is a good chance you are trying to solve a theoretical problem rather than a real problem.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-01 23:06 ` Dave Chinner @ 2015-12-02 9:02 ` Avi Kivity 2015-12-02 12:57 ` Carlos Maiolino 2015-12-02 23:19 ` Dave Chinner 0 siblings, 2 replies; 58+ messages in thread From: Avi Kivity @ 2015-12-02 9:02 UTC (permalink / raw) To: Dave Chinner; +Cc: Glauber Costa, xfs On 12/02/2015 01:06 AM, Dave Chinner wrote: > On Tue, Dec 01, 2015 at 11:38:29PM +0200, Avi Kivity wrote: >> On 12/01/2015 11:19 PM, Dave Chinner wrote: >>> On Tue, Dec 01, 2015 at 09:07:14PM +0200, Avi Kivity wrote: >>>> On 12/01/2015 08:03 PM, Carlos Maiolino wrote: >>>>> Hi Avi, >>>>> >>>>>>> else is going to execute in our place until this thread can make >>>>>>> progress. >>>>>> For us, nothing else can execute in our place, we usually have exactly one >>>>>> thread per logical core. So we are heavily dependent on io_submit not >>>>>> sleeping. >>>>>> >>>>>> The case of a contended lock is, to me, less worrying. It can be reduced by >>>>>> using more allocation groups, which is apparently the shared resource under >>>>>> contention. >>>>>> >>>>> I apologize if I misread your previous comments, but, IIRC you said you can't >>>>> change the directory structure your application is using, and IIRC your >>>>> application does not spread files across several directories. >>>> I miswrote somewhat: the application writes data files and commitlog >>>> files. The data file directory structure is fixed due to >>>> compatibility concerns (it is not a single directory, but some >>>> workloads will see most access on files in a single directory. The >>>> commitlog directory structure is more relaxed, and we can split it >>>> to a directory per shard (=cpu) or something else. >>>> >>>> If worst comes to worst, we'll hack around this and distribute the >>>> data files into more directories, and provide some hack for >>>> compatibility. >>>> >>>>> XFS spread files across the allocation groups, based on the directory these >>>>> files are created, >>>> Idea: create the files in some subdirectory, and immediately move >>>> them to their required location. >>> See xfs_fsr. >> Can you elaborate? I don't see how it is applicable. > Just pointing out that this is what xfs_fsr does to control locality > of allocation for files it is defragmenting. Except that rather than > moving files, it uses XFS_IOC_SWAPEXT to switch the data between two > inodes atomically... Ok, thanks. > >> My hack involves creating the file in a random directory, and while >> it is still zero sized, move it to its final directory. This is >> simply to defeat the ag selection heuristic. > Which you really don't want to do. Why not? For my directory structure, files in the same directory do not share temporal locality. What does the ag selection heuristic give me? > >>>>> trying to keep files as close as possible from their >>>>> metadata. >>>> This is pointless for an SSD. Perhaps XFS should randomize the ag on >>>> nonrotational media instead. >>> Actually, no, it is not pointless. SSDs do not require optimisation >>> for minimal seek time, but data locality is still just as important >>> as spinning disks, if not moreso. Why? Because the garbage >>> collection routines in the SSDs are all about locality and we can't >>> drive garbage collection effectively via discard operations if the >>> filesystem is not keeping temporally related files close together in >>> it's block address space. >> In my case, files in the same directory are not temporally related. >> But I understand where the heuristic comes from. >> >> Maybe an ioctl to set a directory attribute "the files in this >> directory are not temporally related"? > And exactly what does that gain us? I have a directory with commitlog files that are constantly and rapidly being created, appended to, and removed, from all logical cores in the system. Does this not put pressure on that allocation group's locks? > Exactly what problem are you > trying to solve by manipulating file locality that can't be solved > by existing knobs and config options? I admit I don't know much about the existing knobs and config options. Pointers are appreciated. > > Perhaps you'd like to read up on how the inode32 allocator behaves? Indeed I would, pointers are appreciated. > >>> e.g. If the files in a directory are all close together, and the >>> directory is removed, we then leave a big empty contiguous region in >>> the filesystem free space map, and when we send discards over that >>> we end up with a single big trim and the drive handles that far more >> Would this not be defeated if a directory that happens to share the >> allocation group gets populated simultaneously? > Sure. But this sort of thing is rare in the real world, and when > they do occur, it generally only takes small tweaks to algorithms > and layouts make them go away. I don't care to bikeshed about > theoretical problems - I'm in the business of finding the root cause > of the problems users are having and solving those problems. So far > what you've given us is a ball of "there's blocking in AIO > submission", and the only one that is clear cut is the timestamp > update. > > Go back and categorise the types of blocking that you are seeing - > whether it be on the AGIs during inode manipulation, one the AGFs > becuse of concurrent extent allocation, on log forces because of > slow discards in the transcation completion, on the transaction > subsystem because of a lack of log space for concurrent > reservations, etc. And then determine if changing the layout of the > filesystem (e.g. number of AGs, size of log, etc) and different > mount options (e.g. turning off discard, using inode32 allocator, > etc) make any difference to the blocking issues you are seeing. > > Once we know which of the different algorithms is causing the > blocking issues, we'll know a lot more about why we're having > problems and a better idea of what problems we actually need to > solve. I'm happy to hack off the lowest hanging fruit and then go after the next one. I understand you're annoyed at having to defend against what may be non-problems; but for me it is an opportunity to learn about the file system. For us it is the weakest spot in our system, because on the one hand we heavily depend on async behavior and on the other hand Linux is notoriously bad at it. So we are very nervous when blocking happens. > >>> effectively than lots of little trims (i.e. one per file) that the >>> drive cannot do anything useful with because they are all smaller >>> than the internal SSD page/block sizes and so get ignored. This is >>> one of the reasons fstrim is so much more efficient and effective >>> than using the discard mount option. >> In my use case, the files are fairly large, and there is constant >> rewriting (not in-place: files are read, merged, and written back). >> So I'm worried an fstrim can happen too late. > Have you measured the SSD performance degradation over time due to > large overwrites? If not, then again it is a good chance you are > trying to solve a theoretical problem rather than a real problem.... > I'm not worried about that (maybe I should be) but about the SSD reaching internal ENOSPC due to the fstrim happening too late. Consider this scenario, which is quite typical for us: 1. Fill 1/3rd of the disk with a few large files. 2. Copy/merge the data into a new file, occupying another 1/3rd of the disk. 3. Repeat 1+2. If this is repeated few times, the disk can see 100% of its space occupied (depending on how free space is allocated), even if from a user's perspective it is never more than 2/3rds full. Maybe a simple countermeasure is to issue an fstrim every time we write 10%-20% of the disk's capacity. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-02 9:02 ` Avi Kivity @ 2015-12-02 12:57 ` Carlos Maiolino 2015-12-02 23:19 ` Dave Chinner 1 sibling, 0 replies; 58+ messages in thread From: Carlos Maiolino @ 2015-12-02 12:57 UTC (permalink / raw) To: Avi Kivity; +Cc: Glauber Costa, xfs > >>>>compatibility. > >>>> > >>>>>XFS spread files across the allocation groups, based on the directory these > >>>>>files are created, > >>>>Idea: create the files in some subdirectory, and immediately move > >>>>them to their required location. > >>>See xfs_fsr. > >>Can you elaborate? I don't see how it is applicable. > >Just pointing out that this is what xfs_fsr does to control locality > >of allocation for files it is defragmenting. Except that rather than > >moving files, it uses XFS_IOC_SWAPEXT to switch the data between two > >inodes atomically... > > Ok, thanks. > > > > >>My hack involves creating the file in a random directory, and while > >>it is still zero sized, move it to its final directory. This is > >>simply to defeat the ag selection heuristic. > >Which you really don't want to do. > > Why not? For my directory structure, files in the same directory do not > share temporal locality. What does the ag selection heuristic give me? > > > > > > >>>>> trying to keep files as close as possible from their > >>>>>metadata. > >>>>This is pointless for an SSD. Perhaps XFS should randomize the ag on > >>>>nonrotational media instead. > >>>Actually, no, it is not pointless. SSDs do not require optimisation > >>>for minimal seek time, but data locality is still just as important > >>>as spinning disks, if not moreso. Why? Because the garbage > >>>collection routines in the SSDs are all about locality and we can't > >>>drive garbage collection effectively via discard operations if the > >>>filesystem is not keeping temporally related files close together in > >>>it's block address space. > >>In my case, files in the same directory are not temporally related. > >>But I understand where the heuristic comes from. > >> > >>Maybe an ioctl to set a directory attribute "the files in this > >>directory are not temporally related"? > >And exactly what does that gain us? > > I have a directory with commitlog files that are constantly and rapidly > being created, appended to, and removed, from all logical cores in the > system. Does this not put pressure on that allocation group's locks? > > >Exactly what problem are you > >trying to solve by manipulating file locality that can't be solved > >by existing knobs and config options? > > I admit I don't know much about the existing knobs and config options. > Pointers are appreciated. > > > > > >Perhaps you'd like to read up on how the inode32 allocator behaves? > > Indeed I would, pointers are appreciated. > inode32 mount option limit inode allocations to the first AG of the filesystem, instead of allocate inodes across the whole AGs. This exists basically because since inode numbers are based on where the inode is allocated, inodes allocated beyond the first TB of the filesystem, will have 64bit inode numbers, which, in case, might cause compatibility problems with applications which are not able to read 64bit inode numbers. > > > >>>e.g. If the files in a directory are all close together, and the > >>>directory is removed, we then leave a big empty contiguous region in > >>>the filesystem free space map, and when we send discards over that > >>>we end up with a single big trim and the drive handles that far more > >>Would this not be defeated if a directory that happens to share the > >>allocation group gets populated simultaneously? > >Sure. But this sort of thing is rare in the real world, and when > >they do occur, it generally only takes small tweaks to algorithms > >and layouts make them go away. I don't care to bikeshed about > >theoretical problems - I'm in the business of finding the root cause > >of the problems users are having and solving those problems. So far > >what you've given us is a ball of "there's blocking in AIO > >submission", and the only one that is clear cut is the timestamp > >update. > > > >Go back and categorise the types of blocking that you are seeing - > >whether it be on the AGIs during inode manipulation, one the AGFs > >becuse of concurrent extent allocation, on log forces because of > >slow discards in the transcation completion, on the transaction > >subsystem because of a lack of log space for concurrent > >reservations, etc. And then determine if changing the layout of the > >filesystem (e.g. number of AGs, size of log, etc) and different > >mount options (e.g. turning off discard, using inode32 allocator, > >etc) make any difference to the blocking issues you are seeing. > > > >Once we know which of the different algorithms is causing the > >blocking issues, we'll know a lot more about why we're having > >problems and a better idea of what problems we actually need to > >solve. > > I'm happy to hack off the lowest hanging fruit and then go after the next > one. I understand you're annoyed at having to defend against what may be > non-problems; but for me it is an opportunity to learn about the file > system. For us it is the weakest spot in our system, because on the one > hand we heavily depend on async behavior and on the other hand Linux is > notoriously bad at it. So we are very nervous when blocking happens. > > > > >>>effectively than lots of little trims (i.e. one per file) that the > >>>drive cannot do anything useful with because they are all smaller > >>>than the internal SSD page/block sizes and so get ignored. This is > >>>one of the reasons fstrim is so much more efficient and effective > >>>than using the discard mount option. > >>In my use case, the files are fairly large, and there is constant > >>rewriting (not in-place: files are read, merged, and written back). > >>So I'm worried an fstrim can happen too late. > >Have you measured the SSD performance degradation over time due to > >large overwrites? If not, then again it is a good chance you are > >trying to solve a theoretical problem rather than a real problem.... > > > > I'm not worried about that (maybe I should be) but about the SSD reaching > internal ENOSPC due to the fstrim happening too late. > > Consider this scenario, which is quite typical for us: > > 1. Fill 1/3rd of the disk with a few large files. > 2. Copy/merge the data into a new file, occupying another 1/3rd of the disk. > 3. Repeat 1+2. > > If this is repeated few times, the disk can see 100% of its space occupied > (depending on how free space is allocated), even if from a user's > perspective it is never more than 2/3rds full. > > Maybe a simple countermeasure is to issue an fstrim every time we write > 10%-20% of the disk's capacity. > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs -- Carlos _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-02 9:02 ` Avi Kivity 2015-12-02 12:57 ` Carlos Maiolino @ 2015-12-02 23:19 ` Dave Chinner 2015-12-03 12:52 ` Avi Kivity 1 sibling, 1 reply; 58+ messages in thread From: Dave Chinner @ 2015-12-02 23:19 UTC (permalink / raw) To: Avi Kivity; +Cc: Glauber Costa, xfs On Wed, Dec 02, 2015 at 11:02:08AM +0200, Avi Kivity wrote: > On 12/02/2015 01:06 AM, Dave Chinner wrote: > >On Tue, Dec 01, 2015 at 11:38:29PM +0200, Avi Kivity wrote: > >>On 12/01/2015 11:19 PM, Dave Chinner wrote: > >>>>>XFS spread files across the allocation groups, based on the directory these > >>>>>files are created, > >>>>Idea: create the files in some subdirectory, and immediately move > >>>>them to their required location. .... > >>My hack involves creating the file in a random directory, and while > >>it is still zero sized, move it to its final directory. This is > >>simply to defeat the ag selection heuristic. > >Which you really don't want to do. > > Why not? For my directory structure, files in the same directory do > not share temporal locality. What does the ag selection heuristic > give me? Wrong question. The right question is this: what problems does subverting the AG selection heuristic cause me? If you can't answer that question, then you can't quantify the risks involved with making such a behavioural change. > >>>>> trying to keep files as close as possible from their > >>>>>metadata. > >>>>This is pointless for an SSD. Perhaps XFS should randomize the ag on > >>>>nonrotational media instead. > >>>Actually, no, it is not pointless. SSDs do not require optimisation > >>>for minimal seek time, but data locality is still just as important > >>>as spinning disks, if not moreso. Why? Because the garbage > >>>collection routines in the SSDs are all about locality and we can't > >>>drive garbage collection effectively via discard operations if the > >>>filesystem is not keeping temporally related files close together in > >>>it's block address space. > >>In my case, files in the same directory are not temporally related. > >>But I understand where the heuristic comes from. > >> > >>Maybe an ioctl to set a directory attribute "the files in this > >>directory are not temporally related"? > >And exactly what does that gain us? > > I have a directory with commitlog files that are constantly and > rapidly being created, appended to, and removed, from all logical > cores in the system. Does this not put pressure on that allocation > group's locks? Not usually, because if an AG is contended, the allocation algorithm skips the contended AG and selects the next uncontended AG to allocate in. And given that the append algorithm used by the allocator attempts to use the last block of the last extent as the target for the new extent (i.e. contiguous allocation) once a file has skipped to a different AG all allocations will continue in that new AG until it is either full or it becomes contended.... IOWs, when AG contention occurs, the filesystem automatically spreads out the load over multiple AGs. Put simply, we optimise for locality first, but we're willing to compromise on locality to minimise contention when it occurs. But, also, keep in mind that in minimising contention we are still selecting the most local of possible alternatives, and that's something you can't do in userspace.... > >Exactly what problem are you > >trying to solve by manipulating file locality that can't be solved > >by existing knobs and config options? > > I admit I don't know much about the existing knobs and config > options. Pointers are appreciated. You can find some work in progress here: https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/ looks like there's some problem with xfs.org wiki, so the links to the user/training info on this page: http://xfs.org/index.php/XFS_Papers_and_Documentation aren't working. > >Perhaps you'd like to read up on how the inode32 allocator behaves? > > Indeed I would, pointers are appreciated. Inode allocation section here: https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/admin/XFS_Performance_Tuning/filesystem_tunables.asciidoc > >Once we know which of the different algorithms is causing the > >blocking issues, we'll know a lot more about why we're having > >problems and a better idea of what problems we actually need to > >solve. > > I'm happy to hack off the lowest hanging fruit and then go after the > next one. I understand you're annoyed at having to defend against > what may be non-problems; but for me it is an opportunity to learn > about the file system. No, I'm not annoyed. I just don't want to be chasing ghosts and so we need to be on the same page about how to track down these issues. And, beleive me, you'll learn a lot about how the filesystem behaves just by watching how the different configs react to the same input... > For us it is the weakest spot in our system, > because on the one hand we heavily depend on async behavior and on > the other hand Linux is notoriously bad at it. So we are very > nervous when blocking happens. I can't disagree with you there - we really need to fix what we can within the constraints of the OS first, then we once we have it working as well as we can, then we can look to solving the remaining "notoriously bad" AIO problems... > >>>effectively than lots of little trims (i.e. one per file) that the > >>>drive cannot do anything useful with because they are all smaller > >>>than the internal SSD page/block sizes and so get ignored. This is > >>>one of the reasons fstrim is so much more efficient and effective > >>>than using the discard mount option. > >>In my use case, the files are fairly large, and there is constant > >>rewriting (not in-place: files are read, merged, and written back). > >>So I'm worried an fstrim can happen too late. > >Have you measured the SSD performance degradation over time due to > >large overwrites? If not, then again it is a good chance you are > >trying to solve a theoretical problem rather than a real problem.... > > > > I'm not worried about that (maybe I should be) but about the SSD > reaching internal ENOSPC due to the fstrim happening too late. > > Consider this scenario, which is quite typical for us: > > 1. Fill 1/3rd of the disk with a few large files. > 2. Copy/merge the data into a new file, occupying another 1/3rd of the disk. > 3. Repeat 1+2. > > If this is repeated few times, the disk can see 100% of its space > occupied (depending on how free space is allocated), even if from a > user's perspective it is never more than 2/3rds full. I don't think that's true. SSD behaviour largely depends on how much of the LBA space has been written to (i.e. marked used) and so that metric tends to determine how the SSD behaves under such workloads. This is one of the reasons that overprovisioning SSD space (e.g. leaving 25% of the LBA space completely unused) results in better performance under overwrite workloads - there's lots more scratch space for the garbage collector to work with... Hence as long as the filesystem is reusing the same LBA regions for the files, TRIM will probably not make a significant difference to performance because there's still 1/3rd of the LBA region that is "unused". Hence the overwrites go into the unused 1/3rd of the SSD, and the underlying SSD blocks associated with the "overwritten" LBA region are immediately marked free, just like if you issued a trim for that region before you start the overwrite. With the way the XFS allocator works, it fills AGs from lowest to highest blocks, and if you free lots of space down low in the AG then that tends to get reused before the higher offset free space. hence the XFS allocates space in the above workload would result in roughly 1/3rd of the LBA space associated with the filesystem remaining unused. This is another allocator behaviour designed for spinning disks (to keep the data on the faster outer edges of drives) that maps very well to internal SSD allocation/reclaim algorithms.... FWIW, did you know that TRIM generally doesn't return the disk to the performance of a pristine, empty disk? Generally only a secure erase will guarantee that a SSD returns to "empty disk" performance, but that also removes all data from then entire SSD. Hence the baseline "sustained performance" you should be using is not "empty disk" performance, but the performance once the disk has been overwritten completely at least once. Only them will you tend to see what effect TRIM will actually have. > Maybe a simple countermeasure is to issue an fstrim every time we > write 10%-20% of the disk's capacity. Run the workload to steady state performance and measure the degradation as it continues to run and overwrite the SSDs repeatedly. To do this properly you are going to have to sacrifice some SSDs, because you're going to need to overwrite them quite a few times to get an idea of the degradation characteristics and whether a periodic trim makes any difference or not. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-02 23:19 ` Dave Chinner @ 2015-12-03 12:52 ` Avi Kivity 2015-12-04 3:16 ` Dave Chinner 0 siblings, 1 reply; 58+ messages in thread From: Avi Kivity @ 2015-12-03 12:52 UTC (permalink / raw) To: Dave Chinner; +Cc: Glauber Costa, xfs On 12/03/2015 01:19 AM, Dave Chinner wrote: > On Wed, Dec 02, 2015 at 11:02:08AM +0200, Avi Kivity wrote: >> On 12/02/2015 01:06 AM, Dave Chinner wrote: >>> On Tue, Dec 01, 2015 at 11:38:29PM +0200, Avi Kivity wrote: >>>> On 12/01/2015 11:19 PM, Dave Chinner wrote: >>>>>>> XFS spread files across the allocation groups, based on the directory these >>>>>>> files are created, >>>>>> Idea: create the files in some subdirectory, and immediately move >>>>>> them to their required location. > .... >>>> My hack involves creating the file in a random directory, and while >>>> it is still zero sized, move it to its final directory. This is >>>> simply to defeat the ag selection heuristic. >>> Which you really don't want to do. >> Why not? For my directory structure, files in the same directory do >> not share temporal locality. What does the ag selection heuristic >> give me? > Wrong question. The right question is this: what problems does > subverting the AG selection heuristic cause me? > > If you can't answer that question, then you can't quantify the risks > involved with making such a behavioural change. Okay. Any hint about the answer to that question? > >>>>>>> trying to keep files as close as possible from their >>>>>>> metadata. >>>>>> This is pointless for an SSD. Perhaps XFS should randomize the ag on >>>>>> nonrotational media instead. >>>>> Actually, no, it is not pointless. SSDs do not require optimisation >>>>> for minimal seek time, but data locality is still just as important >>>>> as spinning disks, if not moreso. Why? Because the garbage >>>>> collection routines in the SSDs are all about locality and we can't >>>>> drive garbage collection effectively via discard operations if the >>>>> filesystem is not keeping temporally related files close together in >>>>> it's block address space. >>>> In my case, files in the same directory are not temporally related. >>>> But I understand where the heuristic comes from. >>>> >>>> Maybe an ioctl to set a directory attribute "the files in this >>>> directory are not temporally related"? >>> And exactly what does that gain us? >> I have a directory with commitlog files that are constantly and >> rapidly being created, appended to, and removed, from all logical >> cores in the system. Does this not put pressure on that allocation >> group's locks? > Not usually, because if an AG is contended, the allocation algorithm > skips the contended AG and selects the next uncontended AG to > allocate in. And given that the append algorithm used by the > allocator attempts to use the last block of the last extent as the > target for the new extent (i.e. contiguous allocation) once a file > has skipped to a different AG all allocations will continue in that > new AG until it is either full or it becomes contended.... > > IOWs, when AG contention occurs, the filesystem automatically > spreads out the load over multiple AGs. Put simply, we optimise for > locality first, but we're willing to compromise on locality to > minimise contention when it occurs. But, also, keep in mind that > in minimising contention we are still selecting the most local of > possible alternatives, and that's something you can't do in > userspace.... Cool. I don't think "nearly-local" matters much for an SSD (it's either contiguous or it is not), but it's good to know that it's self-tuning wrt. contention. In some good news, Glauber hacked our I/O engine not to throw so many concurrent I/Os at the filesystem, and indeed so the contention reduced. So it's likely we were pushing the fs so hard all the ags were contended, but this is no longer the case. > >>> Exactly what problem are you >>> trying to solve by manipulating file locality that can't be solved >>> by existing knobs and config options? >> I admit I don't know much about the existing knobs and config >> options. Pointers are appreciated. > You can find some work in progress here: > > https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/ > > looks like there's some problem with xfs.org wiki, so the links > to the user/training info on this page: > > http://xfs.org/index.php/XFS_Papers_and_Documentation > > aren't working. > >>> Perhaps you'd like to read up on how the inode32 allocator behaves? >> Indeed I would, pointers are appreciated. > Inode allocation section here: > > https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/admin/XFS_Performance_Tuning/filesystem_tunables.asciidoc Thanks for all the links, I'll study them and see what we can do to tune for our workload. >>> Once we know which of the different algorithms is causing the >>> blocking issues, we'll know a lot more about why we're having >>> problems and a better idea of what problems we actually need to >>> solve. >> I'm happy to hack off the lowest hanging fruit and then go after the >> next one. I understand you're annoyed at having to defend against >> what may be non-problems; but for me it is an opportunity to learn >> about the file system. > No, I'm not annoyed. I just don't want to be chasing ghosts and so > we need to be on the same page about how to track down these issues. > And, beleive me, you'll learn a lot about how the filesystem behaves > just by watching how the different configs react to the same > input... Ok. Looks like I have a lot of homework. > >> For us it is the weakest spot in our system, >> because on the one hand we heavily depend on async behavior and on >> the other hand Linux is notoriously bad at it. So we are very >> nervous when blocking happens. > I can't disagree with you there - we really need to fix what we can > within the constraints of the OS first, then we once we have it > working as well as we can, then we can look to solving the remaining > "notoriously bad" AIO problems... There are lots of users who will be eternally grateful to you if you can get this fixed. Linux has a very bad reputation in this area with the accepted wisdom that you can only use aio reliably against block devices. XFS comes very close, it will make a huge impact if it can be used to do aio reliably, without a lot of constraints on the application. > >>>>> effectively than lots of little trims (i.e. one per file) that the >>>>> drive cannot do anything useful with because they are all smaller >>>>> than the internal SSD page/block sizes and so get ignored. This is >>>>> one of the reasons fstrim is so much more efficient and effective >>>>> than using the discard mount option. >>>> In my use case, the files are fairly large, and there is constant >>>> rewriting (not in-place: files are read, merged, and written back). >>>> So I'm worried an fstrim can happen too late. >>> Have you measured the SSD performance degradation over time due to >>> large overwrites? If not, then again it is a good chance you are >>> trying to solve a theoretical problem rather than a real problem.... >>> >> I'm not worried about that (maybe I should be) but about the SSD >> reaching internal ENOSPC due to the fstrim happening too late. >> >> Consider this scenario, which is quite typical for us: >> >> 1. Fill 1/3rd of the disk with a few large files. >> 2. Copy/merge the data into a new file, occupying another 1/3rd of the disk. >> 3. Repeat 1+2. >> >> If this is repeated few times, the disk can see 100% of its space >> occupied (depending on how free space is allocated), even if from a >> user's perspective it is never more than 2/3rds full. > I don't think that's true. SSD behaviour largely depends on how much > of the LBA space has been written to (i.e. marked used) and so that > metric tends to determine how the SSD behaves under such workloads. > This is one of the reasons that overprovisioning SSD space (e.g. > leaving 25% of the LBA space completely unused) results in better > performance under overwrite workloads - there's lots more scratch > space for the garbage collector to work with... > > Hence as long as the filesystem is reusing the same LBA regions for > the files, TRIM will probably not make a significant difference to > performance because there's still 1/3rd of the LBA region that is > "unused". Hence the overwrites go into the unused 1/3rd of the SSD, > and the underlying SSD blocks associated with the "overwritten" LBA > region are immediately marked free, just like if you issued a trim > for that region before you start the overwrite. > > With the way the XFS allocator works, it fills AGs from lowest to > highest blocks, and if you free lots of space down low in the AG > then that tends to get reused before the higher offset free space. > hence the XFS allocates space in the above workload would result in > roughly 1/3rd of the LBA space associated with the filesystem > remaining unused. This is another allocator behaviour designed for > spinning disks (to keep the data on the faster outer edges of > drives) that maps very well to internal SSD allocation/reclaim > algorithms.... Cool. So we'll keep fstrim usage to daily, or something similarly low. > > FWIW, did you know that TRIM generally doesn't return the disk to > the performance of a pristine, empty disk? Generally only a secure > erase will guarantee that a SSD returns to "empty disk" performance, > but that also removes all data from then entire SSD. Hence the > baseline "sustained performance" you should be using is not "empty > disk" performance, but the performance once the disk has been > overwritten completely at least once. Only them will you tend to see > what effect TRIM will actually have. I did not know that. Maybe that's another factor in why cloud SSDs are so slow. > >> Maybe a simple countermeasure is to issue an fstrim every time we >> write 10%-20% of the disk's capacity. > Run the workload to steady state performance and measure the > degradation as it continues to run and overwrite the SSDs > repeatedly. To do this properly you are going to have to sacrifice > some SSDs, because you're going to need to overwrite them quite a > few times to get an idea of the degradation characteristics and > whether a periodic trim makes any difference or not. Enterprise SSDs are guaranteed for something like N full writes / day for several years, are they not? So such a test can take weeks or months, depending on the ratio between disk size and bandwidth. Still, I guess it has to be done. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-03 12:52 ` Avi Kivity @ 2015-12-04 3:16 ` Dave Chinner 2015-12-08 13:52 ` Avi Kivity 0 siblings, 1 reply; 58+ messages in thread From: Dave Chinner @ 2015-12-04 3:16 UTC (permalink / raw) To: Avi Kivity; +Cc: Glauber Costa, xfs On Thu, Dec 03, 2015 at 02:52:08PM +0200, Avi Kivity wrote: > > > On 12/03/2015 01:19 AM, Dave Chinner wrote: > >On Wed, Dec 02, 2015 at 11:02:08AM +0200, Avi Kivity wrote: > >>On 12/02/2015 01:06 AM, Dave Chinner wrote: > >>>On Tue, Dec 01, 2015 at 11:38:29PM +0200, Avi Kivity wrote: > >>>>On 12/01/2015 11:19 PM, Dave Chinner wrote: > >>>>>>>XFS spread files across the allocation groups, based on the directory these > >>>>>>>files are created, > >>>>>>Idea: create the files in some subdirectory, and immediately move > >>>>>>them to their required location. > >.... > >>>>My hack involves creating the file in a random directory, and while > >>>>it is still zero sized, move it to its final directory. This is > >>>>simply to defeat the ag selection heuristic. > >>>Which you really don't want to do. > >>Why not? For my directory structure, files in the same directory do > >>not share temporal locality. What does the ag selection heuristic > >>give me? > >Wrong question. The right question is this: what problems does > >subverting the AG selection heuristic cause me? > > > >If you can't answer that question, then you can't quantify the risks > >involved with making such a behavioural change. > > Okay. Any hint about the answer to that question? If your file set is randomly distributed across the filesystem, then it's quite likely that the filesystem will use all of the LBA space rather than reusing the same AGs and hence LBA regions. That's going to slowly fragment free space as metadata (which has different lifetimes to data) and long term data gets more widely distributed. That, in term will slowly result in the working dataset being made up of more and smaller extents, whcih will also slowly get more distributed over time, which them means allocation and freeing of extents takes longer, trim becomes less effective because it's workingwith smaller spaces, the SSD's "LBA in use" mapping becomes more fragmented so garbage collection becomes harder, etc... But, really, the only way to tell is to test, measure, observe and analyse.... > >>>>>>This is pointless for an SSD. Perhaps XFS should randomize the ag on > >>>>>>nonrotational media instead. > >>>>>Actually, no, it is not pointless. SSDs do not require optimisation > >>>>>for minimal seek time, but data locality is still just as important > >>>>>as spinning disks, if not moreso. Why? Because the garbage > >>>>>collection routines in the SSDs are all about locality and we can't > >>>>>drive garbage collection effectively via discard operations if the > >>>>>filesystem is not keeping temporally related files close together in > >>>>>it's block address space. > >>>>In my case, files in the same directory are not temporally related. > >>>>But I understand where the heuristic comes from. > >>>> > >>>>Maybe an ioctl to set a directory attribute "the files in this > >>>>directory are not temporally related"? > >>>And exactly what does that gain us? > >>I have a directory with commitlog files that are constantly and > >>rapidly being created, appended to, and removed, from all logical > >>cores in the system. Does this not put pressure on that allocation > >>group's locks? > >Not usually, because if an AG is contended, the allocation algorithm > >skips the contended AG and selects the next uncontended AG to > >allocate in. And given that the append algorithm used by the > >allocator attempts to use the last block of the last extent as the > >target for the new extent (i.e. contiguous allocation) once a file > >has skipped to a different AG all allocations will continue in that > >new AG until it is either full or it becomes contended.... > > > >IOWs, when AG contention occurs, the filesystem automatically > >spreads out the load over multiple AGs. Put simply, we optimise for > >locality first, but we're willing to compromise on locality to > >minimise contention when it occurs. But, also, keep in mind that > >in minimising contention we are still selecting the most local of > >possible alternatives, and that's something you can't do in > >userspace.... > > Cool. I don't think "nearly-local" matters much for an SSD (it's > either contiguous or it is not), but it's good to know that it's > self-tuning wrt. contention. "Nearly local" matters a lot for filesystem free space management and hence minimising the amount o LBA space the filesystem actually uses in the long term given a relatively predicatable workload.... > In some good news, Glauber hacked our I/O engine not to throw so > many concurrent I/Os at the filesystem, and indeed so the contention > reduced. So it's likely we were pushing the fs so hard all the ags > were contended, but this is no longer the case. What is the xfs_info output of the filesystem you tested on? > >With the way the XFS allocator works, it fills AGs from lowest to > >highest blocks, and if you free lots of space down low in the AG > >then that tends to get reused before the higher offset free space. > >hence the XFS allocates space in the above workload would result in > >roughly 1/3rd of the LBA space associated with the filesystem > >remaining unused. This is another allocator behaviour designed for > >spinning disks (to keep the data on the faster outer edges of > >drives) that maps very well to internal SSD allocation/reclaim > >algorithms.... > > Cool. So we'll keep fstrim usage to daily, or something similarly low. Well, it's something you'll need to monitor to determine what the best frequency is, as even fstrim doesn't come for free (esp. if the storage does not support queued TRIM commands). > >FWIW, did you know that TRIM generally doesn't return the disk to > >the performance of a pristine, empty disk? Generally only a secure > >erase will guarantee that a SSD returns to "empty disk" performance, > >but that also removes all data from then entire SSD. Hence the > >baseline "sustained performance" you should be using is not "empty > >disk" performance, but the performance once the disk has been > >overwritten completely at least once. Only them will you tend to see > >what effect TRIM will actually have. > > I did not know that. Maybe that's another factor in why cloud SSDs > are so slow. Have a look at the random write performance consistency graphs for the different enterprise SSDs here: http://www.anandtech.com/show/9430/micron-m510dc-480gb-enterprise-sata-ssd-review/3 You'll see just how different sustained write load performance is to the empty drive performance (which is only the first few hundred seconds of each graph) across the different drives that have been tested. The next page has similar results for mixed random read/write workloads.... That will give you a good idea of how the current enterprise SSDs behave under sustained write load. It's a *lot* better than the way the 1st and 2nd generation drives performed.... > >>write 10%-20% of the disk's capacity. > >Run the workload to steady state performance and measure the > >degradation as it continues to run and overwrite the SSDs > >repeatedly. To do this properly you are going to have to sacrifice > >some SSDs, because you're going to need to overwrite them quite a > >few times to get an idea of the degradation characteristics and > >whether a periodic trim makes any difference or not. > > Enterprise SSDs are guaranteed for something like N full writes / > day for several years, are they not? Yes, usually somewhere between 3-15 DWPD (Drive Writes Per Day) - it typically works out at around 5000 full drive write cycles for enterprise drives. However, at both the low capacity end of the scale or the high performance end (i.e. pcie cards capable of multiple GB/s writes), it's not uncommon to be able to burn a DW cycle in under 10 minutes and so you can easily burn the life out of a drive in a couple of weeks of intense testing.... > So such a test can take weeks > or months, depending on the ratio between disk size and bandwidth. > Still, I guess it has to be done. *nod* Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-04 3:16 ` Dave Chinner @ 2015-12-08 13:52 ` Avi Kivity 2015-12-08 23:13 ` Dave Chinner 0 siblings, 1 reply; 58+ messages in thread From: Avi Kivity @ 2015-12-08 13:52 UTC (permalink / raw) To: Dave Chinner; +Cc: Glauber Costa, xfs On 12/04/2015 05:16 AM, Dave Chinner wrote: > On Thu, Dec 03, 2015 at 02:52:08PM +0200, Avi Kivity wrote: >> >> On 12/03/2015 01:19 AM, Dave Chinner wrote: >>> On Wed, Dec 02, 2015 at 11:02:08AM +0200, Avi Kivity wrote: >>>> On 12/02/2015 01:06 AM, Dave Chinner wrote: >>>>> On Tue, Dec 01, 2015 at 11:38:29PM +0200, Avi Kivity wrote: >>>>>> On 12/01/2015 11:19 PM, Dave Chinner wrote: >>>>>>>>> XFS spread files across the allocation groups, based on the directory these >>>>>>>>> files are created, >>>>>>>> Idea: create the files in some subdirectory, and immediately move >>>>>>>> them to their required location. >>> .... >>>>>> My hack involves creating the file in a random directory, and while >>>>>> it is still zero sized, move it to its final directory. This is >>>>>> simply to defeat the ag selection heuristic. >>>>> Which you really don't want to do. >>>> Why not? For my directory structure, files in the same directory do >>>> not share temporal locality. What does the ag selection heuristic >>>> give me? >>> Wrong question. The right question is this: what problems does >>> subverting the AG selection heuristic cause me? >>> >>> If you can't answer that question, then you can't quantify the risks >>> involved with making such a behavioural change. >> Okay. Any hint about the answer to that question? > If your file set is randomly distributed across the filesystem, I think that happens whether or not I break the "files in the same directory are related" heuristic, because I have many directories. It's just that some of them get churned more than others. > then > it's quite likely that the filesystem will use all of the LBA space > rather than reusing the same AGs and hence LBA regions. That's going > to slowly fragment free space as metadata (which has different > lifetimes to data) and long term data gets more widely distributed. > That, in term will slowly result in the working dataset being made > up of more and smaller extents, whcih will also slowly get more > distributed over time, which them means allocation and freeing of > extents takes longer, trim becomes less effective because it's > workingwith smaller spaces, the SSD's "LBA in use" mapping becomes > more fragmented so garbage collection becomes harder, etc... > > But, really, the only way to tell is to test, measure, observe and > analyse.... Sure. > >>>>>>>> This is pointless for an SSD. Perhaps XFS should randomize the ag on >>>>>>>> nonrotational media instead. >>>>>>> Actually, no, it is not pointless. SSDs do not require optimisation >>>>>>> for minimal seek time, but data locality is still just as important >>>>>>> as spinning disks, if not moreso. Why? Because the garbage >>>>>>> collection routines in the SSDs are all about locality and we can't >>>>>>> drive garbage collection effectively via discard operations if the >>>>>>> filesystem is not keeping temporally related files close together in >>>>>>> it's block address space. >>>>>> In my case, files in the same directory are not temporally related. >>>>>> But I understand where the heuristic comes from. >>>>>> >>>>>> Maybe an ioctl to set a directory attribute "the files in this >>>>>> directory are not temporally related"? >>>>> And exactly what does that gain us? >>>> I have a directory with commitlog files that are constantly and >>>> rapidly being created, appended to, and removed, from all logical >>>> cores in the system. Does this not put pressure on that allocation >>>> group's locks? >>> Not usually, because if an AG is contended, the allocation algorithm >>> skips the contended AG and selects the next uncontended AG to >>> allocate in. And given that the append algorithm used by the >>> allocator attempts to use the last block of the last extent as the >>> target for the new extent (i.e. contiguous allocation) once a file >>> has skipped to a different AG all allocations will continue in that >>> new AG until it is either full or it becomes contended.... >>> >>> IOWs, when AG contention occurs, the filesystem automatically >>> spreads out the load over multiple AGs. Put simply, we optimise for >>> locality first, but we're willing to compromise on locality to >>> minimise contention when it occurs. But, also, keep in mind that >>> in minimising contention we are still selecting the most local of >>> possible alternatives, and that's something you can't do in >>> userspace.... >> Cool. I don't think "nearly-local" matters much for an SSD (it's >> either contiguous or it is not), but it's good to know that it's >> self-tuning wrt. contention. > "Nearly local" matters a lot for filesystem free space management > and hence minimising the amount o LBA space the filesystem actually > uses in the long term given a relatively predicatable workload.... > >> In some good news, Glauber hacked our I/O engine not to throw so >> many concurrent I/Os at the filesystem, and indeed so the contention >> reduced. So it's likely we were pushing the fs so hard all the ags >> were contended, but this is no longer the case. > What is the xfs_info output of the filesystem you tested on? It was a cloud disk so someone else now has the pleasure... > >>> With the way the XFS allocator works, it fills AGs from lowest to >>> highest blocks, and if you free lots of space down low in the AG >>> then that tends to get reused before the higher offset free space. >>> hence the XFS allocates space in the above workload would result in >>> roughly 1/3rd of the LBA space associated with the filesystem >>> remaining unused. This is another allocator behaviour designed for >>> spinning disks (to keep the data on the faster outer edges of >>> drives) that maps very well to internal SSD allocation/reclaim >>> algorithms.... >> Cool. So we'll keep fstrim usage to daily, or something similarly low. > Well, it's something you'll need to monitor to determine what the > best frequency is, as even fstrim doesn't come for free (esp. if the > storage does not support queued TRIM commands). I was able to trigger a load where discard caused io_submit to sleep even on my super-fast nvme drive. The bad news is, disabling discard and running fstrim in parallel with this load also caused io_submit to sleep. > >>> FWIW, did you know that TRIM generally doesn't return the disk to >>> the performance of a pristine, empty disk? Generally only a secure >>> erase will guarantee that a SSD returns to "empty disk" performance, >>> but that also removes all data from then entire SSD. Hence the >>> baseline "sustained performance" you should be using is not "empty >>> disk" performance, but the performance once the disk has been >>> overwritten completely at least once. Only them will you tend to see >>> what effect TRIM will actually have. >> I did not know that. Maybe that's another factor in why cloud SSDs >> are so slow. > Have a look at the random write performance consistency graphs for > the different enterprise SSDs here: > > http://www.anandtech.com/show/9430/micron-m510dc-480gb-enterprise-sata-ssd-review/3 > > You'll see just how different sustained write load performance is to > the empty drive performance (which is only the first few hundred > seconds of each graph) across the different drives that have been > tested. The next page has similar results for mixed random > read/write workloads.... > > That will give you a good idea of how the current enterprise SSDs > behave under sustained write load. It's a *lot* better than the way > the 1st and 2nd generation drives performed.... > >>>> write 10%-20% of the disk's capacity. >>> Run the workload to steady state performance and measure the >>> degradation as it continues to run and overwrite the SSDs >>> repeatedly. To do this properly you are going to have to sacrifice >>> some SSDs, because you're going to need to overwrite them quite a >>> few times to get an idea of the degradation characteristics and >>> whether a periodic trim makes any difference or not. >> Enterprise SSDs are guaranteed for something like N full writes / >> day for several years, are they not? > Yes, usually somewhere between 3-15 DWPD (Drive Writes Per Day) - it > typically works out at around 5000 full drive write cycles for > enterprise drives. However, at both the low capacity end of the > scale or the high performance end (i.e. pcie cards capable of multiple > GB/s writes), it's not uncommon to be able to burn a DW cycle in > under 10 minutes and so you can easily burn the life out of a drive > in a couple of weeks of intense testing.... > >> So such a test can take weeks >> or months, depending on the ratio between disk size and bandwidth. >> Still, I guess it has to be done. > *nod* > > Cheers, > > Dave. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-08 13:52 ` Avi Kivity @ 2015-12-08 23:13 ` Dave Chinner 0 siblings, 0 replies; 58+ messages in thread From: Dave Chinner @ 2015-12-08 23:13 UTC (permalink / raw) To: Avi Kivity; +Cc: Glauber Costa, xfs On Tue, Dec 08, 2015 at 03:52:52PM +0200, Avi Kivity wrote: > >>>With the way the XFS allocator works, it fills AGs from lowest to > >>>highest blocks, and if you free lots of space down low in the AG > >>>then that tends to get reused before the higher offset free space. > >>>hence the XFS allocates space in the above workload would result in > >>>roughly 1/3rd of the LBA space associated with the filesystem > >>>remaining unused. This is another allocator behaviour designed for > >>>spinning disks (to keep the data on the faster outer edges of > >>>drives) that maps very well to internal SSD allocation/reclaim > >>>algorithms.... > >>Cool. So we'll keep fstrim usage to daily, or something similarly low. > >Well, it's something you'll need to monitor to determine what the > >best frequency is, as even fstrim doesn't come for free (esp. if the > >storage does not support queued TRIM commands). > > I was able to trigger a load where discard caused io_submit to sleep > even on my super-fast nvme drive. > > The bad news is, disabling discard and running fstrim in parallel > with this load also caused io_submit to sleep. Well, yes. fstrim is not a magic bullet that /prevents/ discard from interrupting your application's IO - it's just a method under which the impact can be /somewhat controlled/ as it can be scheduled for periods where the impact has minimal interruption (e.g. when load is likely to be light, such as at 3am just before nightly backups are run). Regardless, it sounds like your steady state load could be described as "throwing as much IO as we possible can at the device", but you are then then having "blocking trouble" when maintenance (expensive) operations like TRIM need to be are run. I'm not sure this "blocking" can be prevented completely, because it assumes that you have a device of infinite IO capacity. That is, if you exceed the device's command queue depth and the IO scheduler request queue depth, the block layer will block in the IO scheduler waiting for a request queue slot to come free. Put simply: if you overload the IO subsystem, it will block. There's nothing we can do in the filesystem about this - this is the way the block layer works, and it's architected this way to provide the necessary feedback control for buffered write IO throttling and other congestion control mechanisms in the kernel. Sure, you can set the IO scheduler request queue depth to be really deep to avoid blocking, but this then simply increases your average and worst-case IO latency in overload situations. At some point you have to consider the IO subsystem is overloaded and the application driving it needs to back off. Something has to block when this happens... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-01 17:09 ` Avi Kivity 2015-12-01 18:03 ` Carlos Maiolino @ 2015-12-01 18:51 ` Brian Foster 2015-12-01 19:07 ` Glauber Costa 2015-12-01 19:26 ` Avi Kivity 1 sibling, 2 replies; 58+ messages in thread From: Brian Foster @ 2015-12-01 18:51 UTC (permalink / raw) To: Avi Kivity; +Cc: Glauber Costa, xfs On Tue, Dec 01, 2015 at 07:09:29PM +0200, Avi Kivity wrote: > > > On 12/01/2015 06:29 PM, Brian Foster wrote: > >On Tue, Dec 01, 2015 at 06:08:51PM +0200, Avi Kivity wrote: > >> > >>On 12/01/2015 06:01 PM, Brian Foster wrote: > >>>On Tue, Dec 01, 2015 at 05:22:38PM +0200, Avi Kivity wrote: > >>>>On 12/01/2015 04:56 PM, Brian Foster wrote: > >>>>>On Tue, Dec 01, 2015 at 03:58:28PM +0200, Avi Kivity wrote: > >>>>>>On 12/01/2015 03:11 PM, Brian Foster wrote: > >>>>>>>On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote: > >>>>>>>>On 11/30/2015 06:14 PM, Brian Foster wrote: > >>>>>>>>>On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote: > >>>>>>>>>>On 11/30/2015 04:10 PM, Brian Foster wrote: > >>>>>>>... ... > >>>>>>Won't io_submit() also trigger metadata I/O? Or is that all deferred to > >>>>>>async tasks? I don't mind them blocking each other as long as they let my > >>>>>>io_submit alone. > >>>>>> > >>>>>Yeah, it can trigger metadata reads, force the log (the stale buffer > >>>>>example) or push the AIL (wait on log space). Metadata changes made > >>>>>directly via your I/O request are logged/committed via transactions, > >>>>>which are generally processed asynchronously from that point on. > >>>>> > >>>>>>> io_submit() can probably block in a variety of > >>>>>>>places afaict... it might have to read in the inode extent map, allocate > >>>>>>>blocks, take inode/ag locks, reserve log space for transactions, etc. > >>>>>>Any chance of changing all that to be asynchronous? Doesn't sound too hard, > >>>>>>if somebody else has to do it. > >>>>>> > >>>>>I'm not following... if the fs needs to read in the inode extent map to > >>>>>prepare for an allocation, what else can the thread do but wait? Are you > >>>>>suggesting the request kick off whatever the blocking action happens to > >>>>>be asynchronously and return with an error such that the request can be > >>>>>retried later? > >>>>Not quite, it should be invisible to the caller. > >>>> > >>>>That is, the code called by io_submit() (file_operations::write_iter, it > >>>>seems to be called today) can kick off this operation and have it continue > >>>>from where it left off. > >>>Isn't that generally what happens today? > >>You tell me. According to $subject, apparently not enough. Maybe we're > >>triggering it more often, or we suffer more when it does trigger (the latter > >>probably more likely). > >> > >The original mail describes looking at the sched:sched_switch tracepoint > >which on a quick look, appears to fire whenever a cpu context switch > >occurs. This likely triggers any time we wait on an I/O or a contended > >lock (among other situations I'm sure), and it signifies that something > >else is going to execute in our place until this thread can make > >progress. > > For us, nothing else can execute in our place, we usually have exactly one > thread per logical core. So we are heavily dependent on io_submit not > sleeping. > Yes, this "coroutine model" makes more sense to me from the application perspective. I'm just trying to understand what you're after from the kernel perspective. > The case of a contended lock is, to me, less worrying. It can be reduced by > using more allocation groups, which is apparently the shared resource under > contention. > Yep. > The case of waiting for I/O is much more worrying, because I/O latency are > much higher. But it seems like most of the DIO path does not trigger > locking around I/O (and we are careful to avoid the ones that do, like > writing beyond eof). > > (sorry for repeating myself, I have the feeling we are talking past each > other and want to be on the same page) > Yeah, my point is just that just because the thread blocked on I/O, doesn't mean the cpu can't carry on with some useful work for another task. > > > >>> We submit an I/O which is > >>>asynchronous in nature and wait on a completion, which causes the cpu to > >>>schedule and execute another task until the completion is set by I/O > >>>completion (via an async callback). At that point, the issuing thread > >>>continues where it left off. I suspect I'm missing something... can you > >>>elaborate on what you'd do differently here (and how it helps)? > >>Just apply the same technique everywhere: convert locks to trylock + > >>schedule a continuation on failure. > >> > >I'm certainly not an expert on the kernel scheduling, locking and > >serialization mechanisms, but my understanding is that most things > >outside of spin locks are reschedule points. For example, the > >wait_for_completion() calls XFS uses to wait on I/O boil down to > >schedule_timeout() calls. Buffer locks are implemented as semaphores and > >down() can end up in the same place. > > But, for the most part, XFS seems to be able to avoid sleeping. The call to > __blockdev_direct_IO only launches the I/O, so any locking is only around > cpu operations and, unless there is contention, won't cause us to sleep in > io_submit(). > > Trying to follow the code, it looks like xfs_get_blocks_direct (and > __blockdev_direct_IO's get_block parameter in general) is synchronous, so > we're just lucky to have everything in cache. If it isn't, we block right > there. I really hope I'm misreading this and some other magic is happening > elsewhere instead of this. > Nope, it's synchronous from a code perspective. The xfs_bmapi_read()->xfs_iread_extents() path could have to read in the inode bmap metadata if it hasn't been done already. Note that this should only happen once as everything is stored in-core, so in most cases this is skipped. It's also possible extents are read in via some other path/operation on the inode before an async I/O happens to be submitted (e.g., see some of the other xfs_bmapi_read() callers). Either way, the extents have to be read in at some point and I'd expect that cpu to schedule onto some other task while that thread waits on I/O to complete (read-ahead could also be a factor here, but I haven't really dug into how that is triggered for buffers). Brian > >Brian > > > >>>>Seastar (the async user framework which we use to drive xfs) makes writing > >>>>code like this easy, using continuations; but of course from ordinary > >>>>threaded code it can be quite hard. > >>>> > >>>>btw, there was an attempt to make ext[34] async using this method, but I > >>>>think it was ripped out. Yes, the mortal remains can still be seen with > >>>>'git grep EIOCBQUEUED'. > >>>> > >>>>>>>It sounds to me that first and foremost you want to make sure you don't > >>>>>>>have however many parallel operations you typically have running > >>>>>>>contending on the same inodes or AGs. Hint: creating files under > >>>>>>>separate subdirectories is a quick and easy way to allocate inodes under > >>>>>>>separate AGs (the agno is encoded into the upper bits of the inode > >>>>>>>number). > >>>>>>Unfortunately our directory layout cannot be changed. And doesn't this > >>>>>>require having agcount == O(number of active files)? That is easily in the > >>>>>>thousands. > >>>>>> > >>>>>I think Glauber's O(nr_cpus) comment is probably the more likely > >>>>>ballpark, but really it's something you'll probably just need to test to > >>>>>see how far you need to go to avoid AG contention. > >>>>> > >>>>>I'm primarily throwing the subdir thing out there for testing purposes. > >>>>>It's just an easy way to create inodes in a bunch of separate AGs so you > >>>>>can determine whether/how much it really helps with modified AG counts. > >>>>>I don't know enough about your application design to really comment on > >>>>>that... > >>>>We have O(cpus) shards that operate independently. Each shard writes 32MB > >>>>commitlog files (that are pre-truncated to 32MB to allow concurrent writes > >>>>without blocking); the files are then flushed and closed, and later removed. > >>>>In parallel there are sequential writes and reads of large files using 128kB > >>>>buffers), as well as random reads. Files are immutable (append-only), and > >>>>if a file is being written, it is not concurrently read. In general files > >>>>are not shared across shards. All I/O is async and O_DIRECT. open(), > >>>>truncate(), fdatasync(), and friends are called from a helper thread. > >>>> > >>>>As far as I can tell it should a very friendly load for XFS and SSDs. > >>>> > >>>>>>> Reducing the frequency of block allocation/frees might also be > >>>>>>>another help (e.g., preallocate and reuse files, > >>>>>>Isn't that discouraged for SSDs? > >>>>>> > >>>>>Perhaps, if you're referring to the fact that the blocks are never freed > >>>>>and thus never discarded..? Are you running fstrim? > >>>>mount -o discard. And yes, overwrites are supposedly more expensive than > >>>>trim old data + allocate new data, but maybe if you compare it with the work > >>>>XFS has to do, perhaps the tradeoff is bad. > >>>> > >>>Ok, my understanding is that '-o discard' is not recommended in favor of > >>>periodic fstrim for performance reasons, but that may or may not still > >>>be the case. > >>I understand that most SSDs have queued trim these days, but maybe I'm > >>optimistic. > >> > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-01 18:51 ` Brian Foster @ 2015-12-01 19:07 ` Glauber Costa 2015-12-01 19:35 ` Brian Foster 2015-12-01 19:26 ` Avi Kivity 1 sibling, 1 reply; 58+ messages in thread From: Glauber Costa @ 2015-12-01 19:07 UTC (permalink / raw) To: Brian Foster; +Cc: Avi Kivity, xfs Hi Brian, > > Either way, the extents have to be read in at some point and I'd expect > that cpu to schedule onto some other task while that thread waits on I/O > to complete (read-ahead could also be a factor here, but I haven't > really dug into how that is triggered for buffers). > Being a datastore, we expect to run practically alone in any box we're at. That means that there is no other task to run. If io_submit blocks, the system blocks. The assumption that blocking will just yield the processor for another thread makes sense in the general case where you assume more than one application running and/or more than one thread within the same application. >From our user's perspective, however, every time that happens we can't make progress. It doesn't really matter where it blocks. If io_submit returns without blocking, we can still push more work, even though the kernel is still not ready to proceed. If it blocks, we're dead. > Brian > >> >Brian >> > >> >>>>Seastar (the async user framework which we use to drive xfs) makes writing >> >>>>code like this easy, using continuations; but of course from ordinary >> >>>>threaded code it can be quite hard. >> >>>> >> >>>>btw, there was an attempt to make ext[34] async using this method, but I >> >>>>think it was ripped out. Yes, the mortal remains can still be seen with >> >>>>'git grep EIOCBQUEUED'. >> >>>> >> >>>>>>>It sounds to me that first and foremost you want to make sure you don't >> >>>>>>>have however many parallel operations you typically have running >> >>>>>>>contending on the same inodes or AGs. Hint: creating files under >> >>>>>>>separate subdirectories is a quick and easy way to allocate inodes under >> >>>>>>>separate AGs (the agno is encoded into the upper bits of the inode >> >>>>>>>number). >> >>>>>>Unfortunately our directory layout cannot be changed. And doesn't this >> >>>>>>require having agcount == O(number of active files)? That is easily in the >> >>>>>>thousands. >> >>>>>> >> >>>>>I think Glauber's O(nr_cpus) comment is probably the more likely >> >>>>>ballpark, but really it's something you'll probably just need to test to >> >>>>>see how far you need to go to avoid AG contention. >> >>>>> >> >>>>>I'm primarily throwing the subdir thing out there for testing purposes. >> >>>>>It's just an easy way to create inodes in a bunch of separate AGs so you >> >>>>>can determine whether/how much it really helps with modified AG counts. >> >>>>>I don't know enough about your application design to really comment on >> >>>>>that... >> >>>>We have O(cpus) shards that operate independently. Each shard writes 32MB >> >>>>commitlog files (that are pre-truncated to 32MB to allow concurrent writes >> >>>>without blocking); the files are then flushed and closed, and later removed. >> >>>>In parallel there are sequential writes and reads of large files using 128kB >> >>>>buffers), as well as random reads. Files are immutable (append-only), and >> >>>>if a file is being written, it is not concurrently read. In general files >> >>>>are not shared across shards. All I/O is async and O_DIRECT. open(), >> >>>>truncate(), fdatasync(), and friends are called from a helper thread. >> >>>> >> >>>>As far as I can tell it should a very friendly load for XFS and SSDs. >> >>>> >> >>>>>>> Reducing the frequency of block allocation/frees might also be >> >>>>>>>another help (e.g., preallocate and reuse files, >> >>>>>>Isn't that discouraged for SSDs? >> >>>>>> >> >>>>>Perhaps, if you're referring to the fact that the blocks are never freed >> >>>>>and thus never discarded..? Are you running fstrim? >> >>>>mount -o discard. And yes, overwrites are supposedly more expensive than >> >>>>trim old data + allocate new data, but maybe if you compare it with the work >> >>>>XFS has to do, perhaps the tradeoff is bad. >> >>>> >> >>>Ok, my understanding is that '-o discard' is not recommended in favor of >> >>>periodic fstrim for performance reasons, but that may or may not still >> >>>be the case. >> >>I understand that most SSDs have queued trim these days, but maybe I'm >> >>optimistic. >> >> >> _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-01 19:07 ` Glauber Costa @ 2015-12-01 19:35 ` Brian Foster 2015-12-01 19:45 ` Avi Kivity 0 siblings, 1 reply; 58+ messages in thread From: Brian Foster @ 2015-12-01 19:35 UTC (permalink / raw) To: Glauber Costa; +Cc: Avi Kivity, xfs On Tue, Dec 01, 2015 at 02:07:41PM -0500, Glauber Costa wrote: > Hi Brian, > > > > > > Either way, the extents have to be read in at some point and I'd expect > > that cpu to schedule onto some other task while that thread waits on I/O > > to complete (read-ahead could also be a factor here, but I haven't > > really dug into how that is triggered for buffers). > > > > > Being a datastore, we expect to run practically alone in any box we're > at. That means that there is no other task to run. If io_submit > blocks, the system blocks. The assumption that blocking will just > yield the processor for another thread makes sense in the general case > where you assume more than one application running and/or more than > one thread within the same application. > Hmm, well that helps me understand the concern a bit more. That said, I still question how likely this condition is. Even if this is a completely stripped down userspace with no other applications running, the kernel (or even XFS) alone might have plenty of threads/work items to execute to take care of "background" tasks for various subsystems. Of course, we don't have all of the details of your environment so perhaps this is not the case. Perhaps a more productive approach here might be to find a way to detect this particular case (once you've worked out the other AG count tunings and whatnot that you want to use) where a thread into the fs is blocked and actually has nothing else to do and work from there. I _think_ there is such a thing as an idle task somewhere that might be useful to help quantify this, but I'd have to dig around to understand it better. That actually gives us a concrete scenario to work with, try to reproduce and improve on. It also facilitates improvements that might be beneficial to the general use case as opposed to tailored for this particular use case and highly specific environment. For example, if we find a particular sustained workload that repetitively blocks with nothing else to do, document and characterize it for the list and I'm sure people will come up with a variety of ideas to try and address it. Otherwise, we're kind of just looking around for context switch points and assuming that they will all just block with nothing else to do. For one, I don't think that's really accurate. It's also not very productive an approach and doesn't have any measurable benefit if it doesn't come along with a test case or reproducible condition. Brian > From our user's perspective, however, every time that happens we can't > make progress. It doesn't really matter where it blocks. > > If io_submit returns without blocking, we can still push more work, > even though the kernel is still not ready to proceed. If it blocks, > we're dead. > > > Brian > > > >> >Brian > >> > > >> >>>>Seastar (the async user framework which we use to drive xfs) makes writing > >> >>>>code like this easy, using continuations; but of course from ordinary > >> >>>>threaded code it can be quite hard. > >> >>>> > >> >>>>btw, there was an attempt to make ext[34] async using this method, but I > >> >>>>think it was ripped out. Yes, the mortal remains can still be seen with > >> >>>>'git grep EIOCBQUEUED'. > >> >>>> > >> >>>>>>>It sounds to me that first and foremost you want to make sure you don't > >> >>>>>>>have however many parallel operations you typically have running > >> >>>>>>>contending on the same inodes or AGs. Hint: creating files under > >> >>>>>>>separate subdirectories is a quick and easy way to allocate inodes under > >> >>>>>>>separate AGs (the agno is encoded into the upper bits of the inode > >> >>>>>>>number). > >> >>>>>>Unfortunately our directory layout cannot be changed. And doesn't this > >> >>>>>>require having agcount == O(number of active files)? That is easily in the > >> >>>>>>thousands. > >> >>>>>> > >> >>>>>I think Glauber's O(nr_cpus) comment is probably the more likely > >> >>>>>ballpark, but really it's something you'll probably just need to test to > >> >>>>>see how far you need to go to avoid AG contention. > >> >>>>> > >> >>>>>I'm primarily throwing the subdir thing out there for testing purposes. > >> >>>>>It's just an easy way to create inodes in a bunch of separate AGs so you > >> >>>>>can determine whether/how much it really helps with modified AG counts. > >> >>>>>I don't know enough about your application design to really comment on > >> >>>>>that... > >> >>>>We have O(cpus) shards that operate independently. Each shard writes 32MB > >> >>>>commitlog files (that are pre-truncated to 32MB to allow concurrent writes > >> >>>>without blocking); the files are then flushed and closed, and later removed. > >> >>>>In parallel there are sequential writes and reads of large files using 128kB > >> >>>>buffers), as well as random reads. Files are immutable (append-only), and > >> >>>>if a file is being written, it is not concurrently read. In general files > >> >>>>are not shared across shards. All I/O is async and O_DIRECT. open(), > >> >>>>truncate(), fdatasync(), and friends are called from a helper thread. > >> >>>> > >> >>>>As far as I can tell it should a very friendly load for XFS and SSDs. > >> >>>> > >> >>>>>>> Reducing the frequency of block allocation/frees might also be > >> >>>>>>>another help (e.g., preallocate and reuse files, > >> >>>>>>Isn't that discouraged for SSDs? > >> >>>>>> > >> >>>>>Perhaps, if you're referring to the fact that the blocks are never freed > >> >>>>>and thus never discarded..? Are you running fstrim? > >> >>>>mount -o discard. And yes, overwrites are supposedly more expensive than > >> >>>>trim old data + allocate new data, but maybe if you compare it with the work > >> >>>>XFS has to do, perhaps the tradeoff is bad. > >> >>>> > >> >>>Ok, my understanding is that '-o discard' is not recommended in favor of > >> >>>periodic fstrim for performance reasons, but that may or may not still > >> >>>be the case. > >> >>I understand that most SSDs have queued trim these days, but maybe I'm > >> >>optimistic. > >> >> > >> _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-01 19:35 ` Brian Foster @ 2015-12-01 19:45 ` Avi Kivity 0 siblings, 0 replies; 58+ messages in thread From: Avi Kivity @ 2015-12-01 19:45 UTC (permalink / raw) To: Brian Foster, Glauber Costa; +Cc: xfs On 12/01/2015 09:35 PM, Brian Foster wrote: > On Tue, Dec 01, 2015 at 02:07:41PM -0500, Glauber Costa wrote: >> Hi Brian, >> >> >>> Either way, the extents have to be read in at some point and I'd expect >>> that cpu to schedule onto some other task while that thread waits on I/O >>> to complete (read-ahead could also be a factor here, but I haven't >>> really dug into how that is triggered for buffers). >>> >> >> Being a datastore, we expect to run practically alone in any box we're >> at. That means that there is no other task to run. If io_submit >> blocks, the system blocks. The assumption that blocking will just >> yield the processor for another thread makes sense in the general case >> where you assume more than one application running and/or more than >> one thread within the same application. >> > Hmm, well that helps me understand the concern a bit more. That said, I > still question how likely this condition is. Even if this is a > completely stripped down userspace with no other applications running, > the kernel (or even XFS) alone might have plenty of threads/work items > to execute to take care of "background" tasks for various subsystems. There are not. We grab almost all of memory. All our I/O is O_DIRECT so there is no page cache to write back. There may be softirq work from networking, but in one mode we have (not yet in production), we use a userspace networking stack, so no softirq at all. That said, I doubt this is a problem now. Because the files are large and well laid out, the amount of metadata is small and can easily be cached. We might prime the metadata cache before launching the application, or just ignore the whole problem. It would be much worse with small files, but that isn't the case for us. > > Of course, we don't have all of the details of your environment so > perhaps this is not the case. Perhaps a more productive approach here > might be to find a way to detect this particular case (once you've > worked out the other AG count tunings and whatnot that you want to use) > where a thread into the fs is blocked and actually has nothing else to > do and work from there. I _think_ there is such a thing as an idle task > somewhere that might be useful to help quantify this, but I'd have to > dig around to understand it better. We simply observe the idle cpu counter going above zero. Once we resolve the other issues, we'll instrument the kernel with systemtap and see where the other blockages come from. > That actually gives us a concrete scenario to work with, try to > reproduce and improve on. It also facilitates improvements that might be > beneficial to the general use case as opposed to tailored for this > particular use case and highly specific environment. For example, if we > find a particular sustained workload that repetitively blocks with > nothing else to do, document and characterize it for the list and I'm > sure people will come up with a variety of ideas to try and address it. > Otherwise, we're kind of just looking around for context switch points > and assuming that they will all just block with nothing else to do. For > one, I don't think that's really accurate. It's also not very productive > an approach and doesn't have any measurable benefit if it doesn't come > along with a test case or reproducible condition. I agree completely. We'll try to find better probe points than schedule(). We'll also be able to come up with reproducers, this should not be too hard once we have good instrumentation. > Brian > >> From our user's perspective, however, every time that happens we can't >> make progress. It doesn't really matter where it blocks. >> >> If io_submit returns without blocking, we can still push more work, >> even though the kernel is still not ready to proceed. If it blocks, >> we're dead. >> >>> Brian >>> >>>>> Brian >>>>> >>>>>>>> Seastar (the async user framework which we use to drive xfs) makes writing >>>>>>>> code like this easy, using continuations; but of course from ordinary >>>>>>>> threaded code it can be quite hard. >>>>>>>> >>>>>>>> btw, there was an attempt to make ext[34] async using this method, but I >>>>>>>> think it was ripped out. Yes, the mortal remains can still be seen with >>>>>>>> 'git grep EIOCBQUEUED'. >>>>>>>> >>>>>>>>>>> It sounds to me that first and foremost you want to make sure you don't >>>>>>>>>>> have however many parallel operations you typically have running >>>>>>>>>>> contending on the same inodes or AGs. Hint: creating files under >>>>>>>>>>> separate subdirectories is a quick and easy way to allocate inodes under >>>>>>>>>>> separate AGs (the agno is encoded into the upper bits of the inode >>>>>>>>>>> number). >>>>>>>>>> Unfortunately our directory layout cannot be changed. And doesn't this >>>>>>>>>> require having agcount == O(number of active files)? That is easily in the >>>>>>>>>> thousands. >>>>>>>>>> >>>>>>>>> I think Glauber's O(nr_cpus) comment is probably the more likely >>>>>>>>> ballpark, but really it's something you'll probably just need to test to >>>>>>>>> see how far you need to go to avoid AG contention. >>>>>>>>> >>>>>>>>> I'm primarily throwing the subdir thing out there for testing purposes. >>>>>>>>> It's just an easy way to create inodes in a bunch of separate AGs so you >>>>>>>>> can determine whether/how much it really helps with modified AG counts. >>>>>>>>> I don't know enough about your application design to really comment on >>>>>>>>> that... >>>>>>>> We have O(cpus) shards that operate independently. Each shard writes 32MB >>>>>>>> commitlog files (that are pre-truncated to 32MB to allow concurrent writes >>>>>>>> without blocking); the files are then flushed and closed, and later removed. >>>>>>>> In parallel there are sequential writes and reads of large files using 128kB >>>>>>>> buffers), as well as random reads. Files are immutable (append-only), and >>>>>>>> if a file is being written, it is not concurrently read. In general files >>>>>>>> are not shared across shards. All I/O is async and O_DIRECT. open(), >>>>>>>> truncate(), fdatasync(), and friends are called from a helper thread. >>>>>>>> >>>>>>>> As far as I can tell it should a very friendly load for XFS and SSDs. >>>>>>>> >>>>>>>>>>> Reducing the frequency of block allocation/frees might also be >>>>>>>>>>> another help (e.g., preallocate and reuse files, >>>>>>>>>> Isn't that discouraged for SSDs? >>>>>>>>>> >>>>>>>>> Perhaps, if you're referring to the fact that the blocks are never freed >>>>>>>>> and thus never discarded..? Are you running fstrim? >>>>>>>> mount -o discard. And yes, overwrites are supposedly more expensive than >>>>>>>> trim old data + allocate new data, but maybe if you compare it with the work >>>>>>>> XFS has to do, perhaps the tradeoff is bad. >>>>>>>> >>>>>>> Ok, my understanding is that '-o discard' is not recommended in favor of >>>>>>> periodic fstrim for performance reasons, but that may or may not still >>>>>>> be the case. >>>>>> I understand that most SSDs have queued trim these days, but maybe I'm >>>>>> optimistic. >>>>>> _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-01 18:51 ` Brian Foster 2015-12-01 19:07 ` Glauber Costa @ 2015-12-01 19:26 ` Avi Kivity 2015-12-01 19:41 ` Christoph Hellwig 2015-12-02 0:13 ` Brian Foster 1 sibling, 2 replies; 58+ messages in thread From: Avi Kivity @ 2015-12-01 19:26 UTC (permalink / raw) To: Brian Foster; +Cc: Glauber Costa, xfs On 12/01/2015 08:51 PM, Brian Foster wrote: > On Tue, Dec 01, 2015 at 07:09:29PM +0200, Avi Kivity wrote: >> >> On 12/01/2015 06:29 PM, Brian Foster wrote: >>> On Tue, Dec 01, 2015 at 06:08:51PM +0200, Avi Kivity wrote: >>>> On 12/01/2015 06:01 PM, Brian Foster wrote: >>>>> On Tue, Dec 01, 2015 at 05:22:38PM +0200, Avi Kivity wrote: >>>>>> On 12/01/2015 04:56 PM, Brian Foster wrote: >>>>>>> On Tue, Dec 01, 2015 at 03:58:28PM +0200, Avi Kivity wrote: >>>>>>>> On 12/01/2015 03:11 PM, Brian Foster wrote: >>>>>>>>> On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote: >>>>>>>>>> On 11/30/2015 06:14 PM, Brian Foster wrote: >>>>>>>>>>> On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote: >>>>>>>>>>>> On 11/30/2015 04:10 PM, Brian Foster wrote: >>>>>>>>> ... > ... >>>>>>>> Won't io_submit() also trigger metadata I/O? Or is that all deferred to >>>>>>>> async tasks? I don't mind them blocking each other as long as they let my >>>>>>>> io_submit alone. >>>>>>>> >>>>>>> Yeah, it can trigger metadata reads, force the log (the stale buffer >>>>>>> example) or push the AIL (wait on log space). Metadata changes made >>>>>>> directly via your I/O request are logged/committed via transactions, >>>>>>> which are generally processed asynchronously from that point on. >>>>>>> >>>>>>>>> io_submit() can probably block in a variety of >>>>>>>>> places afaict... it might have to read in the inode extent map, allocate >>>>>>>>> blocks, take inode/ag locks, reserve log space for transactions, etc. >>>>>>>> Any chance of changing all that to be asynchronous? Doesn't sound too hard, >>>>>>>> if somebody else has to do it. >>>>>>>> >>>>>>> I'm not following... if the fs needs to read in the inode extent map to >>>>>>> prepare for an allocation, what else can the thread do but wait? Are you >>>>>>> suggesting the request kick off whatever the blocking action happens to >>>>>>> be asynchronously and return with an error such that the request can be >>>>>>> retried later? >>>>>> Not quite, it should be invisible to the caller. >>>>>> >>>>>> That is, the code called by io_submit() (file_operations::write_iter, it >>>>>> seems to be called today) can kick off this operation and have it continue >>>>> >from where it left off. >>>>> Isn't that generally what happens today? >>>> You tell me. According to $subject, apparently not enough. Maybe we're >>>> triggering it more often, or we suffer more when it does trigger (the latter >>>> probably more likely). >>>> >>> The original mail describes looking at the sched:sched_switch tracepoint >>> which on a quick look, appears to fire whenever a cpu context switch >>> occurs. This likely triggers any time we wait on an I/O or a contended >>> lock (among other situations I'm sure), and it signifies that something >>> else is going to execute in our place until this thread can make >>> progress. >> For us, nothing else can execute in our place, we usually have exactly one >> thread per logical core. So we are heavily dependent on io_submit not >> sleeping. >> > Yes, this "coroutine model" makes more sense to me from the application > perspective. I'm just trying to understand what you're after from the > kernel perspective. It's basically the same thing. To to this, we'd have get_block either return the block's address (if it was in some metadata cache), or, if it was not, issue an I/O that fills (part of) that cache, and as its completion function, a continuation that reruns __blockdev_direct_IO from the point it was stopped so it can submit the data I/O (if the metadata cache was completely updated) or issue the next I/O aiming to fill that metadata cache, if it was not. Without that (and the more complicated code for the write path) io_submit is basically unusable. Yes parts of it are asynchronous, but if other parts of it are still synchronous, we end up requiring thread_count > cpu_count and now we have to context switch constantly. > >> The case of a contended lock is, to me, less worrying. It can be reduced by >> using more allocation groups, which is apparently the shared resource under >> contention. >> > Yep. > >> The case of waiting for I/O is much more worrying, because I/O latency are >> much higher. But it seems like most of the DIO path does not trigger >> locking around I/O (and we are careful to avoid the ones that do, like >> writing beyond eof). >> >> (sorry for repeating myself, I have the feeling we are talking past each >> other and want to be on the same page) >> > Yeah, my point is just that just because the thread blocked on I/O, > doesn't mean the cpu can't carry on with some useful work for another > task. In our case, there is no other task. We run one thread per logical core, so if that thread gets blocked, the cpu idles. The whole point of io_submit() is to issue an I/O and let the caller continue processing immediately. It is the equivalent of O_NONBLOCK for networking code. If O_NONBLOCK did block from time to time, practically all modern network applications would see a huge performance drop. > >>>>> We submit an I/O which is >>>>> asynchronous in nature and wait on a completion, which causes the cpu to >>>>> schedule and execute another task until the completion is set by I/O >>>>> completion (via an async callback). At that point, the issuing thread >>>>> continues where it left off. I suspect I'm missing something... can you >>>>> elaborate on what you'd do differently here (and how it helps)? >>>> Just apply the same technique everywhere: convert locks to trylock + >>>> schedule a continuation on failure. >>>> >>> I'm certainly not an expert on the kernel scheduling, locking and >>> serialization mechanisms, but my understanding is that most things >>> outside of spin locks are reschedule points. For example, the >>> wait_for_completion() calls XFS uses to wait on I/O boil down to >>> schedule_timeout() calls. Buffer locks are implemented as semaphores and >>> down() can end up in the same place. >> But, for the most part, XFS seems to be able to avoid sleeping. The call to >> __blockdev_direct_IO only launches the I/O, so any locking is only around >> cpu operations and, unless there is contention, won't cause us to sleep in >> io_submit(). >> >> Trying to follow the code, it looks like xfs_get_blocks_direct (and >> __blockdev_direct_IO's get_block parameter in general) is synchronous, so >> we're just lucky to have everything in cache. If it isn't, we block right >> there. I really hope I'm misreading this and some other magic is happening >> elsewhere instead of this. >> > Nope, it's synchronous from a code perspective. The > xfs_bmapi_read()->xfs_iread_extents() path could have to read in the > inode bmap metadata if it hasn't been done already. Note that this > should only happen once as everything is stored in-core, so in most > cases this is skipped. It's also possible extents are read in via some > other path/operation on the inode before an async I/O happens to be > submitted (e.g., see some of the other xfs_bmapi_read() callers). Is there (could we add) some ioctl to prime this cache? We could call it from a worker thread where we don't mind blocking during open. What is the eviction policy for this cache? Is it simply the block device's page cache? What about the write path, will we see the same problems there? I would guess the problem is less severe there if the metadata is written with writeback policy. > > Either way, the extents have to be read in at some point and I'd expect > that cpu to schedule onto some other task while that thread waits on I/O > to complete (read-ahead could also be a factor here, but I haven't > really dug into how that is triggered for buffers). To provide an example, our application, which is a database, faces this problem exact at a higher level. Data is stored in data files, and data items' locations are stored in index files. When we read a bit of data, we issue an index read, and pass it a continuation to be executed when the read completes. This latter continuation parses the data and passes it to the code that prepares it for merging with data from other data files, and an eventual return to the user. Having written code for over a year in this style, I've come to expect it to be used everywhere asynchronous I/O is used, but I realize it is fairly hard without good support from a framework that allows continuations to be composed in a natural way. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-01 19:26 ` Avi Kivity @ 2015-12-01 19:41 ` Christoph Hellwig 2015-12-01 19:50 ` Avi Kivity 2015-12-02 0:13 ` Brian Foster 1 sibling, 1 reply; 58+ messages in thread From: Christoph Hellwig @ 2015-12-01 19:41 UTC (permalink / raw) To: Avi Kivity; +Cc: Brian Foster, Glauber Costa, xfs On Tue, Dec 01, 2015 at 09:26:42PM +0200, Avi Kivity wrote: > It's basically the same thing. To to this, we'd have get_block either > return the block's address (if it was in some metadata cache), or, if it was > not, issue an I/O that fills (part of) that cache, and as its completion > function, a continuation that reruns __blockdev_direct_IO from the point it > was stopped so it can submit the data I/O (if the metadata cache was > completely updated) or issue the next I/O aiming to fill that metadata > cache, if it was not. We did something this for blocking reads with great results, and it could be done similarly for direct I/O I think: https://lwn.net/Articles/612483/ Unfortunately Andrew shut it down for odd reasons so it didn't get in. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-01 19:41 ` Christoph Hellwig @ 2015-12-01 19:50 ` Avi Kivity 0 siblings, 0 replies; 58+ messages in thread From: Avi Kivity @ 2015-12-01 19:50 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Brian Foster, Glauber Costa, xfs On 12/01/2015 09:41 PM, Christoph Hellwig wrote: > On Tue, Dec 01, 2015 at 09:26:42PM +0200, Avi Kivity wrote: >> It's basically the same thing. To to this, we'd have get_block either >> return the block's address (if it was in some metadata cache), or, if it was >> not, issue an I/O that fills (part of) that cache, and as its completion >> function, a continuation that reruns __blockdev_direct_IO from the point it >> was stopped so it can submit the data I/O (if the metadata cache was >> completely updated) or issue the next I/O aiming to fill that metadata >> cache, if it was not. > We did something this for blocking reads with great results, and it could be > done similarly for direct I/O I think: > > https://lwn.net/Articles/612483/ > > Unfortunately Andrew shut it down for odd reasons so it didn't get in. How would this work? io_submit() returns -ENOTALLMETADATAISINCACHE, user calls io_submit() again from a worker thread, where he doesn't mind blocking? In fact sys_io_submit() could catch this error and resubmit the I/O on its own using a work item, and io_submit() would become non-blocking, at least on I/O (lock contention may still be a problem, but a smaller one). _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-01 19:26 ` Avi Kivity 2015-12-01 19:41 ` Christoph Hellwig @ 2015-12-02 0:13 ` Brian Foster 2015-12-02 0:57 ` Dave Chinner 2015-12-02 8:34 ` Avi Kivity 1 sibling, 2 replies; 58+ messages in thread From: Brian Foster @ 2015-12-02 0:13 UTC (permalink / raw) To: Avi Kivity; +Cc: Glauber Costa, xfs On Tue, Dec 01, 2015 at 09:26:42PM +0200, Avi Kivity wrote: > On 12/01/2015 08:51 PM, Brian Foster wrote: > >On Tue, Dec 01, 2015 at 07:09:29PM +0200, Avi Kivity wrote: > >> > >>On 12/01/2015 06:29 PM, Brian Foster wrote: > >>>On Tue, Dec 01, 2015 at 06:08:51PM +0200, Avi Kivity wrote: > >>>>On 12/01/2015 06:01 PM, Brian Foster wrote: > >>>>>On Tue, Dec 01, 2015 at 05:22:38PM +0200, Avi Kivity wrote: > >>>>>>On 12/01/2015 04:56 PM, Brian Foster wrote: > >>>>>>>On Tue, Dec 01, 2015 at 03:58:28PM +0200, Avi Kivity wrote: > >>>>>>>>On 12/01/2015 03:11 PM, Brian Foster wrote: > >>>>>>>>>On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote: > >>>>>>>>>>On 11/30/2015 06:14 PM, Brian Foster wrote: > >>>>>>>>>>>On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote: > >>>>>>>>>>>>On 11/30/2015 04:10 PM, Brian Foster wrote: ... > >>The case of waiting for I/O is much more worrying, because I/O latency are > >>much higher. But it seems like most of the DIO path does not trigger > >>locking around I/O (and we are careful to avoid the ones that do, like > >>writing beyond eof). > >> > >>(sorry for repeating myself, I have the feeling we are talking past each > >>other and want to be on the same page) > >> > >Yeah, my point is just that just because the thread blocked on I/O, > >doesn't mean the cpu can't carry on with some useful work for another > >task. > > In our case, there is no other task. We run one thread per logical core, so > if that thread gets blocked, the cpu idles. > > The whole point of io_submit() is to issue an I/O and let the caller > continue processing immediately. It is the equivalent of O_NONBLOCK for > networking code. If O_NONBLOCK did block from time to time, practically all > modern network applications would see a huge performance drop. > Ok, but my understanding is that O_NONBLOCK would return an error code in the blocking case such that userspace can do something else or retry from a blockable context. I think this is similar to what hch posted wrt to the pwrite2() bits for nonblocking buffered I/O or what I was asking about earlier on with regard to returning an error if some blocking would otherwise occur. > > > >>>>> We submit an I/O which is > >>>>>asynchronous in nature and wait on a completion, which causes the cpu to > >>>>>schedule and execute another task until the completion is set by I/O > >>>>>completion (via an async callback). At that point, the issuing thread > >>>>>continues where it left off. I suspect I'm missing something... can you > >>>>>elaborate on what you'd do differently here (and how it helps)? > >>>>Just apply the same technique everywhere: convert locks to trylock + > >>>>schedule a continuation on failure. > >>>> > >>>I'm certainly not an expert on the kernel scheduling, locking and > >>>serialization mechanisms, but my understanding is that most things > >>>outside of spin locks are reschedule points. For example, the > >>>wait_for_completion() calls XFS uses to wait on I/O boil down to > >>>schedule_timeout() calls. Buffer locks are implemented as semaphores and > >>>down() can end up in the same place. > >>But, for the most part, XFS seems to be able to avoid sleeping. The call to > >>__blockdev_direct_IO only launches the I/O, so any locking is only around > >>cpu operations and, unless there is contention, won't cause us to sleep in > >>io_submit(). > >> > >>Trying to follow the code, it looks like xfs_get_blocks_direct (and > >>__blockdev_direct_IO's get_block parameter in general) is synchronous, so > >>we're just lucky to have everything in cache. If it isn't, we block right > >>there. I really hope I'm misreading this and some other magic is happening > >>elsewhere instead of this. > >> > >Nope, it's synchronous from a code perspective. The > >xfs_bmapi_read()->xfs_iread_extents() path could have to read in the > >inode bmap metadata if it hasn't been done already. Note that this > >should only happen once as everything is stored in-core, so in most > >cases this is skipped. It's also possible extents are read in via some > >other path/operation on the inode before an async I/O happens to be > >submitted (e.g., see some of the other xfs_bmapi_read() callers). > > Is there (could we add) some ioctl to prime this cache? We could call it > from a worker thread where we don't mind blocking during open. > I suppose that's possible, or the worker thread could perform some existing operation known to prime the cache. I don't think it's worth getting into without a concrete example, however. The extent read example we're batting around might not ever be a problem (as you've noted due to file size), if files are truncated and recycled, for example. > What is the eviction policy for this cache? Is it simply the block > device's page cache? > IIUC the extent list stays around until the inode is reclaimed. There's a separate buffer cache for metadata buffers. Both types of objects would be reclaimed based on memory pressure. > What about the write path, will we see the same problems there? I would > guess the problem is less severe there if the metadata is written with > writeback policy. > Metadata is modified in-core and handed off to the logging infrastructure via a transaction. The log is flushed to disk some time later and metadata writeback occurs asynchronously via the xfsaild thread. Brian > > > >Either way, the extents have to be read in at some point and I'd expect > >that cpu to schedule onto some other task while that thread waits on I/O > >to complete (read-ahead could also be a factor here, but I haven't > >really dug into how that is triggered for buffers). > > To provide an example, our application, which is a database, faces this > problem exact at a higher level. Data is stored in data files, and data > items' locations are stored in index files. When we read a bit of data, we > issue an index read, and pass it a continuation to be executed when the read > completes. This latter continuation parses the data and passes it to the > code that prepares it for merging with data from other data files, and an > eventual return to the user. > > Having written code for over a year in this style, I've come to expect it to > be used everywhere asynchronous I/O is used, but I realize it is fairly hard > without good support from a framework that allows continuations to be > composed in a natural way. > > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-02 0:13 ` Brian Foster @ 2015-12-02 0:57 ` Dave Chinner 2015-12-02 8:38 ` Avi Kivity 2015-12-02 8:34 ` Avi Kivity 1 sibling, 1 reply; 58+ messages in thread From: Dave Chinner @ 2015-12-02 0:57 UTC (permalink / raw) To: Brian Foster; +Cc: Avi Kivity, Glauber Costa, xfs On Tue, Dec 01, 2015 at 07:13:29PM -0500, Brian Foster wrote: > On Tue, Dec 01, 2015 at 09:26:42PM +0200, Avi Kivity wrote: > > On 12/01/2015 08:51 PM, Brian Foster wrote: > > >On Tue, Dec 01, 2015 at 07:09:29PM +0200, Avi Kivity wrote: > > >Nope, it's synchronous from a code perspective. The > > >xfs_bmapi_read()->xfs_iread_extents() path could have to read in the > > >inode bmap metadata if it hasn't been done already. Note that this > > >should only happen once as everything is stored in-core, so in most > > >cases this is skipped. It's also possible extents are read in via some > > >other path/operation on the inode before an async I/O happens to be > > >submitted (e.g., see some of the other xfs_bmapi_read() callers). > > > > Is there (could we add) some ioctl to prime this cache? We could call it > > from a worker thread where we don't mind blocking during open. > > > > I suppose that's possible, or the worker thread could perform some > existing operation known to prime the cache. I don't think it's worth > getting into without a concrete example, however. You mean like EXT4_IOC_PRECACHE_EXTENTS? You know, that ioctl that the ext4 googlers needed to add because they already had AIO applications that depend on it and they hadn't realised that the could do exactly the same thing with a FIEMAP call? i.e. this call to count the number of extents in the file: struct fiemap fm = { .offset = 0, .length = FIEMAP_MAX_OFFSET, }; res = ioctl(fd, FS_IOC_FIEMAP, &fm); will cause XFS to read in the extent map and cache it. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-02 0:57 ` Dave Chinner @ 2015-12-02 8:38 ` Avi Kivity 0 siblings, 0 replies; 58+ messages in thread From: Avi Kivity @ 2015-12-02 8:38 UTC (permalink / raw) To: Dave Chinner, Brian Foster; +Cc: Glauber Costa, xfs On 12/02/2015 02:57 AM, Dave Chinner wrote: > On Tue, Dec 01, 2015 at 07:13:29PM -0500, Brian Foster wrote: >> On Tue, Dec 01, 2015 at 09:26:42PM +0200, Avi Kivity wrote: >>> On 12/01/2015 08:51 PM, Brian Foster wrote: >>>> On Tue, Dec 01, 2015 at 07:09:29PM +0200, Avi Kivity wrote: >>>> Nope, it's synchronous from a code perspective. The >>>> xfs_bmapi_read()->xfs_iread_extents() path could have to read in the >>>> inode bmap metadata if it hasn't been done already. Note that this >>>> should only happen once as everything is stored in-core, so in most >>>> cases this is skipped. It's also possible extents are read in via some >>>> other path/operation on the inode before an async I/O happens to be >>>> submitted (e.g., see some of the other xfs_bmapi_read() callers). >>> Is there (could we add) some ioctl to prime this cache? We could call it >>> from a worker thread where we don't mind blocking during open. >>> >> I suppose that's possible, or the worker thread could perform some >> existing operation known to prime the cache. I don't think it's worth >> getting into without a concrete example, however. > You mean like EXT4_IOC_PRECACHE_EXTENTS? > > You know, that ioctl that the ext4 googlers needed to add because > they already had AIO applications that depend on it and they hadn't > realised that the could do exactly the same thing with a FIEMAP > call? i.e. this call to count the number of extents in the file: > > struct fiemap fm = { > .offset = 0, > .length = FIEMAP_MAX_OFFSET, > }; > > res = ioctl(fd, FS_IOC_FIEMAP, &fm); > > will cause XFS to read in the extent map and cache it. > Cool, it even appears to be callable with CAP_WHATEVER. So we would use this to prime the metadata caches before startup, if they turn out to be a problem in practice. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-02 0:13 ` Brian Foster 2015-12-02 0:57 ` Dave Chinner @ 2015-12-02 8:34 ` Avi Kivity 2015-12-08 6:03 ` Dave Chinner 1 sibling, 1 reply; 58+ messages in thread From: Avi Kivity @ 2015-12-02 8:34 UTC (permalink / raw) To: Brian Foster; +Cc: Glauber Costa, xfs On 12/02/2015 02:13 AM, Brian Foster wrote: > On Tue, Dec 01, 2015 at 09:26:42PM +0200, Avi Kivity wrote: >> On 12/01/2015 08:51 PM, Brian Foster wrote: >>> On Tue, Dec 01, 2015 at 07:09:29PM +0200, Avi Kivity wrote: >>>> On 12/01/2015 06:29 PM, Brian Foster wrote: >>>>> On Tue, Dec 01, 2015 at 06:08:51PM +0200, Avi Kivity wrote: >>>>>> On 12/01/2015 06:01 PM, Brian Foster wrote: >>>>>>> On Tue, Dec 01, 2015 at 05:22:38PM +0200, Avi Kivity wrote: >>>>>>>> On 12/01/2015 04:56 PM, Brian Foster wrote: >>>>>>>>> On Tue, Dec 01, 2015 at 03:58:28PM +0200, Avi Kivity wrote: >>>>>>>>>> On 12/01/2015 03:11 PM, Brian Foster wrote: >>>>>>>>>>> On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote: >>>>>>>>>>>> On 11/30/2015 06:14 PM, Brian Foster wrote: >>>>>>>>>>>>> On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote: >>>>>>>>>>>>>> On 11/30/2015 04:10 PM, Brian Foster wrote: > ... >>>> The case of waiting for I/O is much more worrying, because I/O latency are >>>> much higher. But it seems like most of the DIO path does not trigger >>>> locking around I/O (and we are careful to avoid the ones that do, like >>>> writing beyond eof). >>>> >>>> (sorry for repeating myself, I have the feeling we are talking past each >>>> other and want to be on the same page) >>>> >>> Yeah, my point is just that just because the thread blocked on I/O, >>> doesn't mean the cpu can't carry on with some useful work for another >>> task. >> In our case, there is no other task. We run one thread per logical core, so >> if that thread gets blocked, the cpu idles. >> >> The whole point of io_submit() is to issue an I/O and let the caller >> continue processing immediately. It is the equivalent of O_NONBLOCK for >> networking code. If O_NONBLOCK did block from time to time, practically all >> modern network applications would see a huge performance drop. >> > Ok, but my understanding is that O_NONBLOCK would return an error code > in the blocking case such that userspace can do something else or retry > from a blockable context. I did not mean the exact equivalent, but in the spirit of allowing a thread to perform an I/O task (networking or file I/O) in parallel with computation. For networking, returning an error is fine because there exists a notification (epoll) to tell userspace when a retry would succeed. For file I/O, there isn't one. Still, returning an error is better than nothing because then, as you say, you can retry in a blockable context. > I think this is similar to what hch posted wrt > to the pwrite2() bits for nonblocking buffered I/O or what I was asking > about earlier on with regard to returning an error if some blocking > would otherwise occur. Yes. Anything except silently blocking! > >>>>>>> We submit an I/O which is >>>>>>> asynchronous in nature and wait on a completion, which causes the cpu to >>>>>>> schedule and execute another task until the completion is set by I/O >>>>>>> completion (via an async callback). At that point, the issuing thread >>>>>>> continues where it left off. I suspect I'm missing something... can you >>>>>>> elaborate on what you'd do differently here (and how it helps)? >>>>>> Just apply the same technique everywhere: convert locks to trylock + >>>>>> schedule a continuation on failure. >>>>>> >>>>> I'm certainly not an expert on the kernel scheduling, locking and >>>>> serialization mechanisms, but my understanding is that most things >>>>> outside of spin locks are reschedule points. For example, the >>>>> wait_for_completion() calls XFS uses to wait on I/O boil down to >>>>> schedule_timeout() calls. Buffer locks are implemented as semaphores and >>>>> down() can end up in the same place. >>>> But, for the most part, XFS seems to be able to avoid sleeping. The call to >>>> __blockdev_direct_IO only launches the I/O, so any locking is only around >>>> cpu operations and, unless there is contention, won't cause us to sleep in >>>> io_submit(). >>>> >>>> Trying to follow the code, it looks like xfs_get_blocks_direct (and >>>> __blockdev_direct_IO's get_block parameter in general) is synchronous, so >>>> we're just lucky to have everything in cache. If it isn't, we block right >>>> there. I really hope I'm misreading this and some other magic is happening >>>> elsewhere instead of this. >>>> >>> Nope, it's synchronous from a code perspective. The >>> xfs_bmapi_read()->xfs_iread_extents() path could have to read in the >>> inode bmap metadata if it hasn't been done already. Note that this >>> should only happen once as everything is stored in-core, so in most >>> cases this is skipped. It's also possible extents are read in via some >>> other path/operation on the inode before an async I/O happens to be >>> submitted (e.g., see some of the other xfs_bmapi_read() callers). >> Is there (could we add) some ioctl to prime this cache? We could call it >> from a worker thread where we don't mind blocking during open. >> > I suppose that's possible, or the worker thread could perform some > existing operation known to prime the cache. I don't think it's worth > getting into without a concrete example, however. The extent read > example we're batting around might not ever be a problem (as you've > noted due to file size), if files are truncated and recycled, for > example. > >> What is the eviction policy for this cache? Is it simply the block >> device's page cache? >> > IIUC the extent list stays around until the inode is reclaimed. There's > a separate buffer cache for metadata buffers. Both types of objects > would be reclaimed based on memory pressure. It comes down to size of disk, size of memory, and average file size. I expect that with current disk and memory sizes the metadata is quite small, so this might not be a problem, and even a cold start would self-prime in a reasonably short time. > >> What about the write path, will we see the same problems there? I would >> guess the problem is less severe there if the metadata is written with >> writeback policy. >> > Metadata is modified in-core and handed off to the logging > infrastructure via a transaction. The log is flushed to disk some time > later and metadata writeback occurs asynchronously via the xfsaild > thread. Unless, I expect, if the log is full. Since we're hammering on the disk quite heavily, the log would be fighting with user I/O and possibly losing. Does XFS throttle user I/O in order to get the log buffers recycled faster? Is there any way for us to keep track of it, and reduce disk pressure when it gets full? Oh you answered that already, /sys/fs/xfs/device/log/*. > > Brian > >>> Either way, the extents have to be read in at some point and I'd expect >>> that cpu to schedule onto some other task while that thread waits on I/O >>> to complete (read-ahead could also be a factor here, but I haven't >>> really dug into how that is triggered for buffers). >> To provide an example, our application, which is a database, faces this >> problem exact at a higher level. Data is stored in data files, and data >> items' locations are stored in index files. When we read a bit of data, we >> issue an index read, and pass it a continuation to be executed when the read >> completes. This latter continuation parses the data and passes it to the >> code that prepares it for merging with data from other data files, and an >> eventual return to the user. >> >> Having written code for over a year in this style, I've come to expect it to >> be used everywhere asynchronous I/O is used, but I realize it is fairly hard >> without good support from a framework that allows continuations to be >> composed in a natural way. >> >> _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-02 8:34 ` Avi Kivity @ 2015-12-08 6:03 ` Dave Chinner 2015-12-08 13:56 ` Avi Kivity 0 siblings, 1 reply; 58+ messages in thread From: Dave Chinner @ 2015-12-08 6:03 UTC (permalink / raw) To: Avi Kivity; +Cc: Brian Foster, Glauber Costa, xfs On Wed, Dec 02, 2015 at 10:34:14AM +0200, Avi Kivity wrote: > On 12/02/2015 02:13 AM, Brian Foster wrote: > >Metadata is modified in-core and handed off to the logging > >infrastructure via a transaction. The log is flushed to disk some time > >later and metadata writeback occurs asynchronously via the xfsaild > >thread. > > Unless, I expect, if the log is full. Since we're hammering on the > disk quite heavily, the log would be fighting with user I/O and > possibly losing. > > Does XFS throttle user I/O in order to get the log buffers recycled faster? No. XFS tags the metadata IO with REQ_META that the IO schedulers can tell the difference between metadata and data IO, and schedule them appropriately. Further. log buffers are also tagged with REQ_SYNC to indicate they are latency sensitive IOs, whcih the IO schedulers again treat differently to minimise latency in the face of bulk async IO which is not latency sensitive. IOWs, IO prioritisation and dispatch scheduling is the job of the IO scheduler, not the filesystem. The filesystem just tells the scheduler how to treat the different types of IO... > Is there any way for us to keep track of it, and reduce disk > pressure when it gets full? Only if you want to make more problems for yourself - second guessing what the filesystem is going to do will only lead you to dancing the Charlie Foxtrot on a regular basis. :/ Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-08 6:03 ` Dave Chinner @ 2015-12-08 13:56 ` Avi Kivity 2015-12-08 23:32 ` Dave Chinner 0 siblings, 1 reply; 58+ messages in thread From: Avi Kivity @ 2015-12-08 13:56 UTC (permalink / raw) To: Dave Chinner; +Cc: Brian Foster, Glauber Costa, xfs On 12/08/2015 08:03 AM, Dave Chinner wrote: > On Wed, Dec 02, 2015 at 10:34:14AM +0200, Avi Kivity wrote: >> On 12/02/2015 02:13 AM, Brian Foster wrote: >>> Metadata is modified in-core and handed off to the logging >>> infrastructure via a transaction. The log is flushed to disk some time >>> later and metadata writeback occurs asynchronously via the xfsaild >>> thread. >> Unless, I expect, if the log is full. Since we're hammering on the >> disk quite heavily, the log would be fighting with user I/O and >> possibly losing. >> >> Does XFS throttle user I/O in order to get the log buffers recycled faster? > No. XFS tags the metadata IO with REQ_META that the IO schedulers > can tell the difference between metadata and data IO, and schedule > them appropriately. Further. log buffers are also tagged with > REQ_SYNC to indicate they are latency sensitive IOs, whcih the IO > schedulers again treat differently to minimise latency in the face > of bulk async IO which is not latency sensitive. > > IOWs, IO prioritisation and dispatch scheduling is the job of the IO > scheduler, not the filesystem. The filesystem just tells the > scheduler how to treat the different types of IO... > >> Is there any way for us to keep track of it, and reduce disk >> pressure when it gets full? > Only if you want to make more problems for yourself - second > guessing what the filesystem is going to do will only lead you to > dancing the Charlie Foxtrot on a regular basis. :/ So far the best approach I found that doesn't conflict with this is to limit io_submit iodepth to the natural disk iodepth (or a small multiple thereof). This seems to keep XFS in its comfort zone, and is good for latency anyway. The only issue is that the only way to obtain this parameter is to measure it. I wrote a small tool to do this [1], but it's a hassle for users. [1] https://github.com/avikivity/diskplorer _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-08 13:56 ` Avi Kivity @ 2015-12-08 23:32 ` Dave Chinner 2015-12-09 8:37 ` Avi Kivity 0 siblings, 1 reply; 58+ messages in thread From: Dave Chinner @ 2015-12-08 23:32 UTC (permalink / raw) To: Avi Kivity; +Cc: Brian Foster, Glauber Costa, xfs On Tue, Dec 08, 2015 at 03:56:52PM +0200, Avi Kivity wrote: > On 12/08/2015 08:03 AM, Dave Chinner wrote: > >On Wed, Dec 02, 2015 at 10:34:14AM +0200, Avi Kivity wrote: > >>On 12/02/2015 02:13 AM, Brian Foster wrote: > >>>Metadata is modified in-core and handed off to the logging > >>>infrastructure via a transaction. The log is flushed to disk some time > >>>later and metadata writeback occurs asynchronously via the xfsaild > >>>thread. > >>Unless, I expect, if the log is full. Since we're hammering on the > >>disk quite heavily, the log would be fighting with user I/O and > >>possibly losing. > >> > >>Does XFS throttle user I/O in order to get the log buffers recycled faster? > >No. XFS tags the metadata IO with REQ_META that the IO schedulers > >can tell the difference between metadata and data IO, and schedule > >them appropriately. Further. log buffers are also tagged with > >REQ_SYNC to indicate they are latency sensitive IOs, whcih the IO > >schedulers again treat differently to minimise latency in the face > >of bulk async IO which is not latency sensitive. > > > >IOWs, IO prioritisation and dispatch scheduling is the job of the IO > >scheduler, not the filesystem. The filesystem just tells the > >scheduler how to treat the different types of IO... > > > >>Is there any way for us to keep track of it, and reduce disk > >>pressure when it gets full? > >Only if you want to make more problems for yourself - second > >guessing what the filesystem is going to do will only lead you to > >dancing the Charlie Foxtrot on a regular basis. :/ > > So far the best approach I found that doesn't conflict with this is > to limit io_submit iodepth to the natural disk iodepth (or a small > multiple thereof). This seems to keep XFS in its comfort zone, and > is good for latency anyway. That's pretty much what I just explained in my previous reply. ;) > The only issue is that the only way to obtain this parameter is to > measure it. Yup, exactly what I've been saying ;) However, You can get a pretty good guess on max concurrency from the device characteristics in sysfs: /sys/block/<dev>/queue/nr_requests gives you the maximum IO scheduler request queue depth, and /sys/block/<dev>/device/queue_depth gives you the hardware command queue depth. E.g. a random iscsi device I have attached to a test VM: $ cat /sys/block/sdc/device/queue_depth 32 $ cat /sys/block/sdc/queue/nr_requests 127 Which means 32 physical IOs can be in flight concurrently, and the IO scheduler will queue up to roughly another 100 discrete IOs before it starts blocking incoming IO requests (127 is the typical io scheduler queue depth default). That means maximum non-blocking concurrency is going to be around 100-130 IOs in flight at once. > I wrote a small tool to do this [1], but it's a hassle for users. > > [1] https://github.com/avikivity/diskplorer I note that the NVMe device you tested in the description hits maximum performance with concurrency at around 110-120 read IOs in flight. :) Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-08 23:32 ` Dave Chinner @ 2015-12-09 8:37 ` Avi Kivity 0 siblings, 0 replies; 58+ messages in thread From: Avi Kivity @ 2015-12-09 8:37 UTC (permalink / raw) To: Dave Chinner; +Cc: Brian Foster, Glauber Costa, xfs On 12/09/2015 01:32 AM, Dave Chinner wrote: > On Tue, Dec 08, 2015 at 03:56:52PM +0200, Avi Kivity wrote: >> On 12/08/2015 08:03 AM, Dave Chinner wrote: >>> On Wed, Dec 02, 2015 at 10:34:14AM +0200, Avi Kivity wrote: >>>> On 12/02/2015 02:13 AM, Brian Foster wrote: >>>>> Metadata is modified in-core and handed off to the logging >>>>> infrastructure via a transaction. The log is flushed to disk some time >>>>> later and metadata writeback occurs asynchronously via the xfsaild >>>>> thread. >>>> Unless, I expect, if the log is full. Since we're hammering on the >>>> disk quite heavily, the log would be fighting with user I/O and >>>> possibly losing. >>>> >>>> Does XFS throttle user I/O in order to get the log buffers recycled faster? >>> No. XFS tags the metadata IO with REQ_META that the IO schedulers >>> can tell the difference between metadata and data IO, and schedule >>> them appropriately. Further. log buffers are also tagged with >>> REQ_SYNC to indicate they are latency sensitive IOs, whcih the IO >>> schedulers again treat differently to minimise latency in the face >>> of bulk async IO which is not latency sensitive. >>> >>> IOWs, IO prioritisation and dispatch scheduling is the job of the IO >>> scheduler, not the filesystem. The filesystem just tells the >>> scheduler how to treat the different types of IO... >>> >>>> Is there any way for us to keep track of it, and reduce disk >>>> pressure when it gets full? >>> Only if you want to make more problems for yourself - second >>> guessing what the filesystem is going to do will only lead you to >>> dancing the Charlie Foxtrot on a regular basis. :/ >> So far the best approach I found that doesn't conflict with this is >> to limit io_submit iodepth to the natural disk iodepth (or a small >> multiple thereof). This seems to keep XFS in its comfort zone, and >> is good for latency anyway. > That's pretty much what I just explained in my previous reply. ;) > >> The only issue is that the only way to obtain this parameter is to >> measure it. > Yup, exactly what I've been saying ;) > > However, You can get a pretty good guess on max concurrency from the > device characteristics in sysfs: > > /sys/block/<dev>/queue/nr_requests That's just a fixed number. AFAICT, it isn't derived from the actual device. "measure it" is better than nothing, but when you want to distribute software that works out of the box and does not need extensive tuning, it leaves something to be desired. I'm thinking about detecting the limit dynamically (below the limit, throughput is roughly proportional to concurrency; above the limit, throughput is fixed while latency is proportional to concurrency). The problem is that the measurement is very noisy, the more so because we are two layers above the hardware, and driving it from cores that try very hard not to communicate. The right place to do this is the block layer. > gives you the maximum IO scheduler request queue depth, and > > /sys/block/<dev>/device/queue_depth > > gives you the hardware command queue depth. That's more useful, but it really describes the bus/link/protocol rather than the device itself. I don't have this queue_depth attribute for my nvme0n1 device (4.1.7). > > E.g. a random iscsi device I have attached to a test VM: > > $ cat /sys/block/sdc/device/queue_depth > 32 > $ cat /sys/block/sdc/queue/nr_requests > 127 > > Which means 32 physical IOs can be in flight concurrently, and the > IO scheduler will queue up to roughly another 100 discrete IOs > before it starts blocking incoming IO requests (127 is the typical > io scheduler queue depth default). That means maximum non-blocking > concurrency is going to be around 100-130 IOs in flight at once. > >> I wrote a small tool to do this [1], but it's a hassle for users. >> >> [1] https://github.com/avikivity/diskplorer > I note that the NVMe device you tested in the description hits > maximum performance with concurrency at around 110-120 read IOs in > flight. :) > > We increased nr_requests for the test so it wouldn't block. So it's the actual device characteristics, not an artifact of the software stack. If you consider a RAID of these, you can easily need a few hundred concurrent ops. IIRC nvme's maximum iodepth is 64k. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-01 15:22 ` Avi Kivity 2015-12-01 16:01 ` Brian Foster @ 2015-12-01 21:04 ` Dave Chinner 2015-12-01 21:10 ` Glauber Costa 2015-12-01 21:24 ` Avi Kivity 1 sibling, 2 replies; 58+ messages in thread From: Dave Chinner @ 2015-12-01 21:04 UTC (permalink / raw) To: Avi Kivity; +Cc: Brian Foster, Glauber Costa, xfs On Tue, Dec 01, 2015 at 05:22:38PM +0200, Avi Kivity wrote: > On 12/01/2015 04:56 PM, Brian Foster wrote: > >On Tue, Dec 01, 2015 at 03:58:28PM +0200, Avi Kivity wrote: > >>> io_submit() can probably block in a variety of > >>>places afaict... it might have to read in the inode extent map, allocate > >>>blocks, take inode/ag locks, reserve log space for transactions, etc. > >>Any chance of changing all that to be asynchronous? Doesn't sound too hard, > >>if somebody else has to do it. > >> > >I'm not following... if the fs needs to read in the inode extent map to > >prepare for an allocation, what else can the thread do but wait? Are you > >suggesting the request kick off whatever the blocking action happens to > >be asynchronously and return with an error such that the request can be > >retried later? > > Not quite, it should be invisible to the caller. I have a pony I can sell you. > That is, the code called by io_submit() > (file_operations::write_iter, it seems to be called today) can kick > off this operation and have it continue from where it left off. This is a problem that people have tried to solve in the past (e.g. syslets, etc) where the thread executes until it has to block, and then it's handled off to a worker thread/syslet to block and the main process returns with EIOCBQUEUED. Basically, you're asking for a real AIO infrastructure to beintroduced into the kernel, and I think that's beyond what us XFS guys can do... > >>> Reducing the frequency of block allocation/frees might also be > >>>another help (e.g., preallocate and reuse files, > >>Isn't that discouraged for SSDs? > >> > >Perhaps, if you're referring to the fact that the blocks are never freed > >and thus never discarded..? Are you running fstrim? > > mount -o discard. And yes, overwrites are supposedly more expensive > than trim old data + allocate new data, but maybe if you compare it > with the work XFS has to do, perhaps the tradeoff is bad. Oh, you do realise that using "-o discard" causes significant delays in journal commit processing? i.e. the journal commit completion blocks until all the discards have been submitted and waited on *synchronously*. This is a problem with the linux block layer in that blkdev_issue_discard() is a synchronous operation..... Hence if you are seeing delays in transactions (e.g. timestamp updates) it's entirely possible that things will get much better if you remove the discard mount option. It's much better from a performance perspective to use the fstrim command every so often - fstrim issues discard operations in the context of the fstrim process - it does not interact with the transaction subsystem at all. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-01 21:04 ` Dave Chinner @ 2015-12-01 21:10 ` Glauber Costa 2015-12-01 21:39 ` Dave Chinner 2015-12-01 21:24 ` Avi Kivity 1 sibling, 1 reply; 58+ messages in thread From: Glauber Costa @ 2015-12-01 21:10 UTC (permalink / raw) To: Dave Chinner; +Cc: Avi Kivity, Brian Foster, xfs On Tue, Dec 1, 2015 at 4:04 PM, Dave Chinner <david@fromorbit.com> wrote: > On Tue, Dec 01, 2015 at 05:22:38PM +0200, Avi Kivity wrote: >> On 12/01/2015 04:56 PM, Brian Foster wrote: >> >On Tue, Dec 01, 2015 at 03:58:28PM +0200, Avi Kivity wrote: >> >>> io_submit() can probably block in a variety of >> >>>places afaict... it might have to read in the inode extent map, allocate >> >>>blocks, take inode/ag locks, reserve log space for transactions, etc. >> >>Any chance of changing all that to be asynchronous? Doesn't sound too hard, >> >>if somebody else has to do it. >> >> >> >I'm not following... if the fs needs to read in the inode extent map to >> >prepare for an allocation, what else can the thread do but wait? Are you >> >suggesting the request kick off whatever the blocking action happens to >> >be asynchronously and return with an error such that the request can be >> >retried later? >> >> Not quite, it should be invisible to the caller. > > I have a pony I can sell you. > >> That is, the code called by io_submit() >> (file_operations::write_iter, it seems to be called today) can kick >> off this operation and have it continue from where it left off. > > This is a problem that people have tried to solve in the past (e.g. > syslets, etc) where the thread executes until it has to block, and > then it's handled off to a worker thread/syslet to block and the > main process returns with EIOCBQUEUED. > > Basically, you're asking for a real AIO infrastructure to > beintroduced into the kernel, and I think that's beyond what us XFS > guys can do... > >> >>> Reducing the frequency of block allocation/frees might also be >> >>>another help (e.g., preallocate and reuse files, >> >>Isn't that discouraged for SSDs? >> >> >> >Perhaps, if you're referring to the fact that the blocks are never freed >> >and thus never discarded..? Are you running fstrim? >> >> mount -o discard. And yes, overwrites are supposedly more expensive >> than trim old data + allocate new data, but maybe if you compare it >> with the work XFS has to do, perhaps the tradeoff is bad. > > Oh, you do realise that using "-o discard" causes significant delays > in journal commit processing? i.e. the journal commit completion > blocks until all the discards have been submitted and waited on > *synchronously*. This is a problem with the linux block layer in > that blkdev_issue_discard() is a synchronous operation..... > > Hence if you are seeing delays in transactions (e.g. timestamp updates) > it's entirely possible that things will get much better if you > remove the discard mount option. It's much better from a performance > perspective to use the fstrim command every so often - fstrim issues > discard operations in the context of the fstrim process - it does > not interact with the transaction subsystem at all. Hi Dave, This is news to me. However, in the disk that we have used during the acquisition of this trace, discard doesn't seem to be supported: $ sudo fstrim /data/ fstrim: /data/: the discard operation is not supported In that case, if I understand correctly the discard mount option should be a noop, no? That recommendation is great for our general case, though. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-01 21:10 ` Glauber Costa @ 2015-12-01 21:39 ` Dave Chinner 0 siblings, 0 replies; 58+ messages in thread From: Dave Chinner @ 2015-12-01 21:39 UTC (permalink / raw) To: Glauber Costa; +Cc: Avi Kivity, Brian Foster, xfs On Tue, Dec 01, 2015 at 04:10:45PM -0500, Glauber Costa wrote: > On Tue, Dec 1, 2015 at 4:04 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Tue, Dec 01, 2015 at 05:22:38PM +0200, Avi Kivity wrote: > >> On 12/01/2015 04:56 PM, Brian Foster wrote: > >> mount -o discard. And yes, overwrites are supposedly more expensive > >> than trim old data + allocate new data, but maybe if you compare it > >> with the work XFS has to do, perhaps the tradeoff is bad. > > > > Oh, you do realise that using "-o discard" causes significant delays > > in journal commit processing? i.e. the journal commit completion > > blocks until all the discards have been submitted and waited on > > *synchronously*. This is a problem with the linux block layer in > > that blkdev_issue_discard() is a synchronous operation..... > > > > Hence if you are seeing delays in transactions (e.g. timestamp updates) > > it's entirely possible that things will get much better if you > > remove the discard mount option. It's much better from a performance > > perspective to use the fstrim command every so often - fstrim issues > > discard operations in the context of the fstrim process - it does > > not interact with the transaction subsystem at all. > > Hi Dave, > > This is news to me. > > However, in the disk that we have used during the acquisition of this > trace, discard doesn't seem to be supported: > $ sudo fstrim /data/ > fstrim: /data/: the discard operation is not supported > > In that case, if I understand correctly the discard mount option > should be a noop, no? XFS still makes the blkdev_issue_discard() calls, though, because the block device can turn discard support on and off dynamically. e.g. raid devices where a faulty drive is replaced temporarily with a drive that doesn't have discard support. The block device suddenly starts returning -EOPNOTSUPP to the filesystem from blkdev_issue_discard() calls. However, the admin then replaces that drive with a new one that des have discard support, and now blkdev_issue_discard() works as exepected. IOWs, if you set the mount option, XFS will always attempt to issue discards... > That recommendation is great for our general case, though. For the moment. Given lots of time, reworking this code could greatly reduce the impact/overhead of it and so make it practical to enable. There's a lot of work to get to that point, though... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-01 21:04 ` Dave Chinner 2015-12-01 21:10 ` Glauber Costa @ 2015-12-01 21:24 ` Avi Kivity 2015-12-01 21:31 ` Glauber Costa 1 sibling, 1 reply; 58+ messages in thread From: Avi Kivity @ 2015-12-01 21:24 UTC (permalink / raw) To: Dave Chinner; +Cc: Brian Foster, Glauber Costa, xfs On 12/01/2015 11:04 PM, Dave Chinner wrote: > On Tue, Dec 01, 2015 at 05:22:38PM +0200, Avi Kivity wrote: >> On 12/01/2015 04:56 PM, Brian Foster wrote: >>> On Tue, Dec 01, 2015 at 03:58:28PM +0200, Avi Kivity wrote: >>>>> io_submit() can probably block in a variety of >>>>> places afaict... it might have to read in the inode extent map, allocate >>>>> blocks, take inode/ag locks, reserve log space for transactions, etc. >>>> Any chance of changing all that to be asynchronous? Doesn't sound too hard, >>>> if somebody else has to do it. >>>> >>> I'm not following... if the fs needs to read in the inode extent map to >>> prepare for an allocation, what else can the thread do but wait? Are you >>> suggesting the request kick off whatever the blocking action happens to >>> be asynchronously and return with an error such that the request can be >>> retried later? >> Not quite, it should be invisible to the caller. > I have a pony I can sell you. You already sold me a pony. >> That is, the code called by io_submit() >> (file_operations::write_iter, it seems to be called today) can kick >> off this operation and have it continue from where it left off. > This is a problem that people have tried to solve in the past (e.g. > syslets, etc) where the thread executes until it has to block, and > then it's handled off to a worker thread/syslet to block and the > main process returns with EIOCBQUEUED. Yes, I remember that. > Basically, you're asking for a real AIO infrastructure to > beintroduced into the kernel, and I think that's beyond what us XFS > guys can do... Sure you can, Dave. In fact you feel an irresistible urge to do it. But I don't think the EIOCBQUEUED thing need be repeated. We can have a simpler implementation: - Add a task flag TIF_AIO, which causes any new I/O to fail with EAIOWOULDBLOCK. - have __blockdev_direct_IO() do its block-mapping operations with TIF_AIO set (but remove it just before issuing the bio). - sys_aio_submit() catches EAIOWOULDBLOCK and resubmits the aio in a work item, this time without TIF_AIO games. The effect would be similar to EIOCBQUEUED, but simpler, as instead of issuing any metadata I/O you abort the operation and restart it from scratch. > >>>>> Reducing the frequency of block allocation/frees might also be >>>>> another help (e.g., preallocate and reuse files, >>>> Isn't that discouraged for SSDs? >>>> >>> Perhaps, if you're referring to the fact that the blocks are never freed >>> and thus never discarded..? Are you running fstrim? >> mount -o discard. And yes, overwrites are supposedly more expensive >> than trim old data + allocate new data, but maybe if you compare it >> with the work XFS has to do, perhaps the tradeoff is bad. > Oh, you do realise that using "-o discard" causes significant delays > in journal commit processing? i.e. the journal commit completion > blocks until all the discards have been submitted and waited on > *synchronously*. This is a problem with the linux block layer in > that blkdev_issue_discard() is a synchronous operation..... I do now. What's the unicode for a crying face? > Hence if you are seeing delays in transactions (e.g. timestamp updates) > it's entirely possible that things will get much better if you > remove the discard mount option. It's much better from a performance > perspective to use the fstrim command every so often - fstrim issues > discard operations in the context of the fstrim process - it does > not interact with the transaction subsystem at all. > > All right. On the other hand we have to know when to issue it. That would be when nn% of the disk area have been rewritten. Is there some counter I can poll every minute or so for this? Not doing the fstrim in time would cause the disk performance to tank. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-01 21:24 ` Avi Kivity @ 2015-12-01 21:31 ` Glauber Costa 0 siblings, 0 replies; 58+ messages in thread From: Glauber Costa @ 2015-12-01 21:31 UTC (permalink / raw) To: Avi Kivity; +Cc: Brian Foster, xfs > >>> That is, the code called by io_submit() >>> (file_operations::write_iter, it seems to be called today) can kick >>> off this operation and have it continue from where it left off. >> >> This is a problem that people have tried to solve in the past (e.g. >> syslets, etc) where the thread executes until it has to block, and >> then it's handled off to a worker thread/syslet to block and the >> main process returns with EIOCBQUEUED. > > > Yes, I remember that. > >> Basically, you're asking for a real AIO infrastructure to >> beintroduced into the kernel, and I think that's beyond what us XFS >> guys can do... > > > Sure you can, Dave. In fact you feel an irresistible urge to do it. What is that? Are you that anxious for the Star Wars premiere that you are trying your very own jedi mind tricks?? > I do now. What's the unicode for a crying face? > >> Hence if you are seeing delays in transactions (e.g. timestamp updates) >> it's entirely possible that things will get much better if you >> remove the discard mount option. It's much better from a performance >> perspective to use the fstrim command every so often - fstrim issues >> discard operations in the context of the fstrim process - it does >> not interact with the transaction subsystem at all. >> >> > > All right. On the other hand we have to know when to issue it. That would > be when nn% of the disk area have been rewritten. Is there some counter I > can poll every minute or so for this? Not doing the fstrim in time would > cause the disk performance to tank. Note, as I said, that while this is a really good general recommendation from down under, that was not likely to have had any effect in the current trace - that disk does not support discard, and I am assuming the mount option becomes a noop in this case. > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-11-30 14:10 ` Brian Foster 2015-11-30 14:29 ` Avi Kivity @ 2015-11-30 15:49 ` Glauber Costa 2015-12-01 13:11 ` Brian Foster 1 sibling, 1 reply; 58+ messages in thread From: Glauber Costa @ 2015-11-30 15:49 UTC (permalink / raw) To: Brian Foster; +Cc: Avi Kivity, xfs Hi Brian >> 1) xfs_buf_lock -> xfs_log_force. >> >> I've started wondering what would make xfs_log_force sleep. But then I >> have noticed that xfs_log_force will only be called when a buffer is >> marked stale. Most of the times a buffer is marked stale seems to be >> due to errors. Although that is not my case (more on that), it got me >> thinking that maybe the right thing to do would be to avoid hitting >> this case altogether? >> > > I'm not following where you get the "only if marked stale" part..? It > certainly looks like that's one potential purpose for the call, but this > is called in a variety of other places as well. E.g., forcing the log > via pushing on the ail when it has pinned items is another case. The ail > push itself can originate from transaction reservation, etc., when log > space is needed. In other words, I'm not sure this is something that's > easily controlled from userspace, if at all. Rather, it's a significant > part of the wider state machine the fs uses to manage logging. I understand that in general xfs_log_force can be called from many places. But in our traces the ones we see sleeping are coming from xfs_buf_lock. The code for xfs_buf_lock reads: if (atomic_read(&bp->b_pin_count) && (bp->b_flags & XBF_STALE)) xfs_log_force(bp->b_target->bt_mount, 0); which if I read correctly, will be called only for stale buffers. True thing they happen to be pinned as well, but somehow the stale part caught my attention. It seemed to me from briefly looking that the stale condition was a more "avoidable" one. (keep in mind I am not an awesome XFSer, may be missing something) > >> The file example-stale.txt contains a backtrace of the case where we >> are being marked as stale. It seems to be happening when we convert >> the the inode's extents from unwritten to real. Can this case be >> avoided? I won't pretend I know the intricacies of this, but couldn't >> we be keeping extents from the very beginning to avoid creating stale >> buffers? >> > > This is down in xfs_fs_evict_inode()->xfs_inactive(), which is generally > when an inode is evicted from cache. In this case, it looks like the > inode is unlinked (permanently removed), the extents are being removed > and a bmap btree block is being invalidated as part of that overall > process. I don't think this has anything to do with unwritten extents. > Cool. If the inode is indeed unliked, could that sill be triggering that condition in xfs_buf_lock? I am not even close to fully understanding how XFS manages and/or recycles buffers, but it seems to me that if an inode is going away, there isn't really any reason to contend for its buffers. >> 2) xfs_buf_lock -> down >> This is one I truly don't understand. What can be causing contention >> in this lock? We never have two different cores writing to the same >> buffer, nor should we have the same core doingCAP_FOWNER so. >> > > This is not one single lock. An XFS buffer is the data structure used to > modify/log/read-write metadata on-disk and each buffer has its own lock > to prevent corruption. Buffer lock contention is possible because the > filesystem has bits of "global" metadata that has to be updated via > buffers. I see. Since I hate guessing, is there any way you would recommend for us to probe the system to determine if this contention scenario is indeed the one we are seeing? We usually open a file, write to it from a single core only, sequentially, direct IO only, as well behavedly as we can, with all the effort in the world to be good kids to the extent Santa will bring us presents without us even asking. So we were very puzzled to see contention. Contention for global metadata updates is the best explanation we've had so far, and would be great if we could verify it is indeed the case. > > For example, usually one has multiple allocation groups to maximize > parallelism, but we still have per-ag metadata that has to be tracked > globally with respect to each AG (e.g., free space trees, inode > allocation trees, etc.). Any operation that affects this metadata (e.g., > block/inode allocation) has to lock the agi/agf buffers along with any > buffers associated with the modified btree leaf/node blocks, etc. > > One example in your attached perf traces has several threads looking to > acquire the AGF, which is a per-AG data structure for tracking free > space in the AG. One thread looks like the inode eviction case noted > above (freeing blocks), another looks like a file truncate (also freeing > blocks), and yet another is a block allocation due to a direct I/O > write. Were any of these operations directed to an inode in a separate > AG, they would be able to proceed in parallel (but I believe they would > still hit the same codepaths as far as perf can tell). This is great, great, awesome info Brian. Thanks. We are so far allocating inodes and truncating them when we need a new one, but maybe there is some allocation pattern that is friendlier to the AG? I understand that with such a data structure it may very well be impossible to get rid of all waiting, but we will certainly do all we can to mitigate it. > >> 3) xfs_file_aio_write_checks -> file_update_time -> xfs_vn_update_time >> >> You guys seem to have an interface to avoid that, by setting the >> FMODE_NOCMTIME flag. This is done by issuing the open by handle ioctl, >> which will set this flag for all regular files. That's great, but that >> ioctl required CAP_SYS_ADMIN, which is a big no for us, since we run >> our server as an unprivileged user. I don't understand, however, why >> such an strict check is needed. If we have full rights on the >> filesystem, why can't we issue this operation? In my view, CAP_FOWNER >> should already be enough.I do understand the handles have to be stable >> and a file can have its ownership changed, in which case the previous >> owner would keep the handle valid. Is that the reason you went with >> the most restrictive capability ? > > I'm not familiar enough with the open-by-handle stuff to comment on the > permission constraints. Perhaps Dave or others can comment further on > this bit... > > Brian Thanks again Brian. The pointer to the AG stuff was really helpful. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-11-30 15:49 ` Glauber Costa @ 2015-12-01 13:11 ` Brian Foster 2015-12-01 13:39 ` Glauber Costa 0 siblings, 1 reply; 58+ messages in thread From: Brian Foster @ 2015-12-01 13:11 UTC (permalink / raw) To: Glauber Costa; +Cc: Avi Kivity, xfs On Mon, Nov 30, 2015 at 10:49:27AM -0500, Glauber Costa wrote: > Hi Brian > > >> 1) xfs_buf_lock -> xfs_log_force. > >> > >> I've started wondering what would make xfs_log_force sleep. But then I > >> have noticed that xfs_log_force will only be called when a buffer is > >> marked stale. Most of the times a buffer is marked stale seems to be > >> due to errors. Although that is not my case (more on that), it got me > >> thinking that maybe the right thing to do would be to avoid hitting > >> this case altogether? > >> > > > > I'm not following where you get the "only if marked stale" part..? It > > certainly looks like that's one potential purpose for the call, but this > > is called in a variety of other places as well. E.g., forcing the log > > via pushing on the ail when it has pinned items is another case. The ail > > push itself can originate from transaction reservation, etc., when log > > space is needed. In other words, I'm not sure this is something that's > > easily controlled from userspace, if at all. Rather, it's a significant > > part of the wider state machine the fs uses to manage logging. > > I understand that in general xfs_log_force can be called from many > places. But in our traces the ones we see sleeping are coming from > xfs_buf_lock. The code for xfs_buf_lock reads: > > if (atomic_read(&bp->b_pin_count) && (bp->b_flags & XBF_STALE)) > xfs_log_force(bp->b_target->bt_mount, 0); > > > which if I read correctly, will be called only for stale buffers. True > thing they happen to be pinned as well, but somehow the stale part > caught my attention. It seemed to me from briefly looking that the > stale condition was a more "avoidable" one. (keep in mind I am not an > awesome XFSer, may be missing something) > It's not really avoidable. It's an expected buffer state when metadata blocks/buffers are freed as actions must be taken if they are reused. Dave's breakdown describes how you might be hitting this based on your traces. > > > >> The file example-stale.txt contains a backtrace of the case where we > >> are being marked as stale. It seems to be happening when we convert > >> the the inode's extents from unwritten to real. Can this case be > >> avoided? I won't pretend I know the intricacies of this, but couldn't > >> we be keeping extents from the very beginning to avoid creating stale > >> buffers? > >> > > > > This is down in xfs_fs_evict_inode()->xfs_inactive(), which is generally > > when an inode is evicted from cache. In this case, it looks like the > > inode is unlinked (permanently removed), the extents are being removed > > and a bmap btree block is being invalidated as part of that overall > > process. I don't think this has anything to do with unwritten extents. > > > > Cool. If the inode is indeed unliked, could that sill be triggering > that condition in xfs_buf_lock? I am not even close to fully > understanding how XFS manages and/or recycles buffers, but it seems to > me that if an inode is going away, there isn't really any reason to > contend for its buffers. > I think so.. the inode removal will free various metadata blocks and they could still be in the stale state by the time something else comes along and allocates them (re: Dave's example covers this). > >> 2) xfs_buf_lock -> down > >> This is one I truly don't understand. What can be causing contention > >> in this lock? We never have two different cores writing to the same > >> buffer, nor should we have the same core doingCAP_FOWNER so. > >> > > > > This is not one single lock. An XFS buffer is the data structure used to > > modify/log/read-write metadata on-disk and each buffer has its own lock > > to prevent corruption. Buffer lock contention is possible because the > > filesystem has bits of "global" metadata that has to be updated via > > buffers. > > I see. Since I hate guessing, is there any way you would recommend for > us to probe the system to determine if this contention scenario is > indeed the one we are seeing? > I'd probably use perf as you are, I'm just not sure if there's any real way to tell which threads are contending on which AGs. I'm not terribly experienced with perf. I suppose that if the AGF/AGI read/lock traces are high up on the list, the chances are higher you're spending a lot of time waiting on AGs. It's relatively easy to increase the AG count and allocate inodes under separate AGs (see my previous mail) as an experiment to see if such contention is reduced. > We usually open a file, write to it from a single core only, > sequentially, direct IO only, as well behavedly as we can, with all > the effort in the world to be good kids to the extent Santa will bring > us presents without us even asking. > > So we were very puzzled to see contention. Contention for global > metadata updates is the best explanation we've had so far, and would > be great if we could verify it is indeed the case. > > > > > For example, usually one has multiple allocation groups to maximize > > parallelism, but we still have per-ag metadata that has to be tracked > > globally with respect to each AG (e.g., free space trees, inode > > allocation trees, etc.). Any operation that affects this metadata (e.g., > > block/inode allocation) has to lock the agi/agf buffers along with any > > buffers associated with the modified btree leaf/node blocks, etc. > > > > One example in your attached perf traces has several threads looking to > > acquire the AGF, which is a per-AG data structure for tracking free > > space in the AG. One thread looks like the inode eviction case noted > > above (freeing blocks), another looks like a file truncate (also freeing > > blocks), and yet another is a block allocation due to a direct I/O > > write. Were any of these operations directed to an inode in a separate > > AG, they would be able to proceed in parallel (but I believe they would > > still hit the same codepaths as far as perf can tell). > > This is great, great, awesome info Brian. Thanks. We are so far > allocating inodes and truncating them when we need a new one, but > maybe there is some allocation pattern that is friendlier to the AG? I > understand that with such a data structure it may very well be > impossible to get rid of all waiting, but we will certainly do all we > can to mitigate it. > The truncate will free blocks and require block allocation on subsequent writes. That might be something you could look into trying to avoid (e.g., keeping files around and reusing space), but that depends on your application design. Inodes chunks are allocated and freed dynamically by default as well. The 'ikeep' mount option keeps inode chunks around indefinitely (even if individual inodes are all freed) if you wanted to avoid inode chunk reallocation and know you have a fairly stable working set of inodes. Per-inode extent size hints might be another option to increase the size of allocations and perhaps reduce the number of them. Brian > > > >> 3) xfs_file_aio_write_checks -> file_update_time -> xfs_vn_update_time > >> > >> You guys seem to have an interface to avoid that, by setting the > >> FMODE_NOCMTIME flag. This is done by issuing the open by handle ioctl, > >> which will set this flag for all regular files. That's great, but that > >> ioctl required CAP_SYS_ADMIN, which is a big no for us, since we run > >> our server as an unprivileged user. I don't understand, however, why > >> such an strict check is needed. If we have full rights on the > >> filesystem, why can't we issue this operation? In my view, CAP_FOWNER > >> should already be enough.I do understand the handles have to be stable > >> and a file can have its ownership changed, in which case the previous > >> owner would keep the handle valid. Is that the reason you went with > >> the most restrictive capability ? > > > > I'm not familiar enough with the open-by-handle stuff to comment on the > > permission constraints. Perhaps Dave or others can comment further on > > this bit... > > > > Brian > > Thanks again Brian. The pointer to the AG stuff was really helpful. > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-01 13:11 ` Brian Foster @ 2015-12-01 13:39 ` Glauber Costa 2015-12-01 14:02 ` Brian Foster 0 siblings, 1 reply; 58+ messages in thread From: Glauber Costa @ 2015-12-01 13:39 UTC (permalink / raw) To: Brian Foster; +Cc: Avi Kivity, xfs > > The truncate will free blocks and require block allocation on subsequent > writes. That might be something you could look into trying to avoid > (e.g., keeping files around and reusing space), but that depends on your > application design. This one is a bit hard. We have a journal-like structure for the modifications issued to the data store, which dominates most of our write workloads (including this one that I am discussing here). We could keep they around by renaming them outside of user visibility and then renaming them back, but that would mean that we are now using twice as much space. Perhaps we could use a pool that can at least guarantee one or two allocations from a pre-existing file. I am assuming here that renaming the file won't block. If it does, we are better off not doing so. > Inodes chunks are allocated and freed dynamically by > default as well. The 'ikeep' mount option keeps inode chunks around > indefinitely (even if individual inodes are all freed) if you wanted to > avoid inode chunk reallocation and know you have a fairly stable working > set of inodes. I believe we do have a fairly stable inode working set, even though that depends a bit on what's considered stable. For our journal-like structure, we will keep them around until we are sure the information is safe and them delete them - creating new ones as we receive more data. But that's always bounded in size. Am I correct to understand that ikeep being passed, new allocations would just reuse space from the empty chunks on disk? > Per-inode extent size hints might be another option to > increase the size of allocations and perhaps reduce the number of them. > That's absolutely greatastic. Our files for that journal are all more or less the same size. That's a great candidate for a hint. > Brian Thanks again, Brian _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-12-01 13:39 ` Glauber Costa @ 2015-12-01 14:02 ` Brian Foster 0 siblings, 0 replies; 58+ messages in thread From: Brian Foster @ 2015-12-01 14:02 UTC (permalink / raw) To: Glauber Costa; +Cc: Avi Kivity, xfs On Tue, Dec 01, 2015 at 08:39:06AM -0500, Glauber Costa wrote: > > > > The truncate will free blocks and require block allocation on subsequent > > writes. That might be something you could look into trying to avoid > > (e.g., keeping files around and reusing space), but that depends on your > > application design. > > > This one is a bit hard. We have a journal-like structure for the > modifications issued to the data store, which dominates most of our > write workloads (including this one that I am discussing here). We > could keep they around by renaming them outside of user visibility and > then renaming them back, but that would mean that we are now using > twice as much space. Perhaps we could use a pool that can at least > guarantee one or two allocations from a pre-existing file. I am > assuming here that renaming the file won't block. If it does, we are > better off not doing so. > > > Inodes chunks are allocated and freed dynamically by > > default as well. The 'ikeep' mount option keeps inode chunks around > > indefinitely (even if individual inodes are all freed) if you wanted to > > avoid inode chunk reallocation and know you have a fairly stable working > > set of inodes. > > I believe we do have a fairly stable inode working set, even though > that depends a bit on what's considered stable. For our journal-like > structure, we will keep them around until we are sure the information > is safe and them delete them - creating new ones as we receive more > data. But that's always bounded in size. > > Am I correct to understand that ikeep being passed, new allocations > would just reuse space from the empty chunks on disk? > Yes.. current behavior is that inodes are allocated and freed in chunks of 64. When the entire chunk of inodes is freed from the namespace, the chunk is freed (i.e., it is now free space). With ikeep, inode chunks are never freed. When an individual inode allocation request is made, the inode is allocated from one of the existing inode chunks before a new chunk is allocated. The tradeoff is that you could consume a significant amount of space with inodes, free a bunch of them and that space is not freed. So that is something to be aware of for your use case, particularly if the fs has other uses from your journaling mechanism described above because it affects the entire fs. > > > Per-inode extent size hints might be another option to > > increase the size of allocations and perhaps reduce the number of them. > > > > That's absolutely greatastic. Our files for that journal are all more > or less the same size. That's a great candidate for a hint. > You could consider preallocation (fallocate()) as well if you know the full size in advance. Brian > > Brian > > Thanks again, Brian _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-11-28 2:43 sleeps and waits during io_submit Glauber Costa 2015-11-30 14:10 ` Brian Foster @ 2015-11-30 23:10 ` Dave Chinner 2015-11-30 23:51 ` Glauber Costa 1 sibling, 1 reply; 58+ messages in thread From: Dave Chinner @ 2015-11-30 23:10 UTC (permalink / raw) To: Glauber Costa; +Cc: Avi Kivity, xfs On Fri, Nov 27, 2015 at 09:43:50PM -0500, Glauber Costa wrote: > Hello my dear XFSers, > > For those of you who don't know, we at ScyllaDB produce a modern NoSQL > data store that, at the moment, runs on top of XFS only. We deal > exclusively with asynchronous and direct IO, due to our > thread-per-core architecture. Due to that, we avoid issuing any > operation that will sleep. > > While debugging an extreme case of bad performance (most likely > related to a not-so-great disk), I have found a variety of cases in > which XFS blocks. To find those, I have used perf record -e > sched:sched_switch -p <pid_of_db>, and I am attaching the perf report > as xfs-sched_switch.log. Please note that this doesn't tell me for how > long we block, but as mentioned before, blocking operations outside > our control are detrimental to us regardless of the elapsed time. > > For those who are not acquainted to our internals, please ignore > everything in that file but the xfs functions. For the xfs symbols, > there are two kinds of events: the ones that are a children of > io_submit, where we don't tolerate blocking, and the ones that are > children of our helper IO thread, to where we push big operations that > we know will block until we can get rid of them all. We care about the > former and ignore the latter. > > Please allow me to ask you a couple of questions about those findings. > If we are doing anything wrong, advise on best practices is truly > welcome. > > 1) xfs_buf_lock -> xfs_log_force. > > I've started wondering what would make xfs_log_force sleep. But then I > have noticed that xfs_log_force will only be called when a buffer is > marked stale. Most of the times a buffer is marked stale seems to be > due to errors. Although that is not my case (more on that), it got me > thinking that maybe the right thing to do would be to avoid hitting > this case altogether? The buffer is stale because it has recently been freed, and we cannot re-use it until the transaction that freed it has been committed to the journal. e.g. this trace: --3.15%-- _xfs_log_force xfs_log_force xfs_buf_lock _xfs_buf_find xfs_buf_get_map xfs_trans_get_buf_map xfs_btree_get_bufl xfs_bmap_extents_to_btree xfs_bmap_add_extent_hole_real xfs_bmapi_write xfs_iomap_write_direct __xfs_get_blocks xfs_get_blocks_direct do_blockdev_direct_IO __blockdev_direct_IO xfs_vm_direct_IO xfs_file_dio_aio_write xfs_file_write_iter aio_run_iocb do_io_submit sys_io_submit entry_SYSCALL_64_fastpath io_submit 0x46d98a implies something like this has happened: truncate free extent extent list now fits inline in inode btree-to-extent format change free last bmbt block X mark block contents stale add block to busy extent list place block on AGFL AIO write allocate extent inline extents full extent-to-btree conversion allocate bmbt block grab block X from free list get locked buffer for block X xfs_buf_lock buffer stale log force to commit previous free transaction to disk ..... log force completes buffer removed from busy extent list buffer no longer stale add bmbt record to new block update btree indexes .... And, looking at the trace you attached, we see this: dump_stack xfs_buf_stale xfs_trans_binval xfs_bmap_btree_to_extents <<<<<<<<< xfs_bunmapi xfs_itruncate_extents So I'd say this is a pretty clear indication that we're immediately recycling freed blocks from the free list here.... > The file example-stale.txt contains a backtrace of the case where we > are being marked as stale. It seems to be happening when we convert > the the inode's extents from unwritten to real. Can this case be > avoided? I won't pretend I know the intricacies of this, but couldn't > we be keeping extents from the very beginning to avoid creating stale > buffers? > > 2) xfs_buf_lock -> down > This is one I truly don't understand. What can be causing contention > in this lock? We never have two different cores writing to the same > buffer, nor should we have the same core doingCAP_FOWNER so. As Brian pointed out, this is probably AGF or AGI contention - attempting to allocate/free extents or inodes in the same AG at the same time will show this sort of pattern. This trace shows AGF contention: down xfs_buf_lock _xfs_buf_find xfs_buf_get_map xfs_buf_read_map xfs_trans_read_buf_map xfs_read_agf ..... This trace shows AGI contention: down xfs_buf_lock _xfs_buf_find xfs_buf_get_map xfs_buf_read_map xfs_trans_read_buf_map xfs_read_agi .... > 3) xfs_file_aio_write_checks -> file_update_time -> xfs_vn_update_time This trace?: rwsem_down_write_failed call_rwsem_down_write_failed xfs_ilock xfs_vn_update_time file_update_time xfs_file_aio_write_checks xfs_file_dio_aio_write xfs_file_write_iter aio_run_iocb do_io_submit sys_io_submit entry_SYSCALL_64_fastpath io_submit 0x46d98a Which is an mtime timestamp update racing with another operation that takes the internal metadata lock (e.g. block mapping/allocation for that inode). > You guys seem to have an interface to avoid that, by setting the > FMODE_NOCMTIME flag. We've talked about exposing this through open() for Ceph. http://www.kernelhub.org/?p=2&msg=744325 https://lkml.org/lkml/2015/5/15/671 Read the first thread for why it's problematic to expose this to userspace - I won't repeat it all here. As it is, there was some code recently hacked into ext4 to reduce mtime overhead - the MS_LAZYTIME superblock option. What it does it prevent the inode from being marked dirty when timestamps are updated and hence the timestamps are never journalled or written until something else marks the inode metadata dirty (e.g. block allocation). ext4 gets away with this because it doesn't actually journal timestamp changes - they get captured in the journal by other modifications that are journalled, but still rely on the inod being marked dirty for fsync, writeback and inode cache eviction doing the right thing. The ext4 implementation looks like the timestamp updates can be thrown away, as the inodes are not marked dirty and so on memory pressure they will simply be reclaimed without writing back the updated timestamps that are held in memory. I suspect fsync will also have problems on ext4 as the inode is not metadata dirty or journalled, and hence the timestamp changes will never get written back. And, IMO, the worst part of the ext4 implementation is that the inode buffer writeback code in ext4 now checks to see if any of the other inodes in the buffer being written back need to have their inode timestamps updated. IOWs, ext4 now does writeback of /unjournalled metadata/ to inodes that are purely timestamp dirty. We /used/ to do shit like this in XFS. We got rid of it in preference of journalling everything because the corner cases in log recovery meant that after a crash the inodes were in inconsistent states, and that meant we had unexpected, unpredictable recovery behaviour where files weren't the expected size and/or didn't contain the expected data. Hence going back to the bad old days of hacking around the journal "for speed" doesn't exactly fill me with joy. Let me have a think about how we can implement lazytime in a sane way, such that fsync() works correctly, we don't throw away timstamp changes in memory reclaim and we don't write unlogged changes to the on-disk locations.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-11-30 23:10 ` Dave Chinner @ 2015-11-30 23:51 ` Glauber Costa 2015-12-01 20:30 ` Dave Chinner 0 siblings, 1 reply; 58+ messages in thread From: Glauber Costa @ 2015-11-30 23:51 UTC (permalink / raw) To: Dave Chinner; +Cc: Avi Kivity, xfs Hi Dave On Mon, Nov 30, 2015 at 6:10 PM, Dave Chinner <david@fromorbit.com> wrote: > On Fri, Nov 27, 2015 at 09:43:50PM -0500, Glauber Costa wrote: >> Hello my dear XFSers, >> >> For those of you who don't know, we at ScyllaDB produce a modern NoSQL >> data store that, at the moment, runs on top of XFS only. We deal >> exclusively with asynchronous and direct IO, due to our >> thread-per-core architecture. Due to that, we avoid issuing any >> operation that will sleep. >> >> While debugging an extreme case of bad performance (most likely >> related to a not-so-great disk), I have found a variety of cases in >> which XFS blocks. To find those, I have used perf record -e >> sched:sched_switch -p <pid_of_db>, and I am attaching the perf report >> as xfs-sched_switch.log. Please note that this doesn't tell me for how >> long we block, but as mentioned before, blocking operations outside >> our control are detrimental to us regardless of the elapsed time. >> >> For those who are not acquainted to our internals, please ignore >> everything in that file but the xfs functions. For the xfs symbols, >> there are two kinds of events: the ones that are a children of >> io_submit, where we don't tolerate blocking, and the ones that are >> children of our helper IO thread, to where we push big operations that >> we know will block until we can get rid of them all. We care about the >> former and ignore the latter. >> >> Please allow me to ask you a couple of questions about those findings. >> If we are doing anything wrong, advise on best practices is truly >> welcome. >> >> 1) xfs_buf_lock -> xfs_log_force. >> >> I've started wondering what would make xfs_log_force sleep. But then I >> have noticed that xfs_log_force will only be called when a buffer is >> marked stale. Most of the times a buffer is marked stale seems to be >> due to errors. Although that is not my case (more on that), it got me >> thinking that maybe the right thing to do would be to avoid hitting >> this case altogether? > > The buffer is stale because it has recently been freed, and we > cannot re-use it until the transaction that freed it has been > committed to the journal. e.g. this trace: > > --3.15%-- _xfs_log_force > xfs_log_force > xfs_buf_lock > _xfs_buf_find > xfs_buf_get_map > xfs_trans_get_buf_map > xfs_btree_get_bufl > xfs_bmap_extents_to_btree > xfs_bmap_add_extent_hole_real > xfs_bmapi_write > xfs_iomap_write_direct > __xfs_get_blocks > xfs_get_blocks_direct > do_blockdev_direct_IO > __blockdev_direct_IO > xfs_vm_direct_IO > xfs_file_dio_aio_write > xfs_file_write_iter > aio_run_iocb > do_io_submit > sys_io_submit > entry_SYSCALL_64_fastpath > io_submit > 0x46d98a > > implies something like this has happened: > > truncate > free extent > extent list now fits inline in inode > btree-to-extent format change > free last bmbt block X > mark block contents stale > add block to busy extent list > place block on AGFL > > AIO write > allocate extent > inline extents full > extent-to-btree conversion > allocate bmbt block > grab block X from free list > get locked buffer for block X > xfs_buf_lock > buffer stale > log force to commit previous free transaction to disk > ..... > log force completes > buffer removed from busy extent list > buffer no longer stale > add bmbt record to new block > update btree indexes > .... > > > And, looking at the trace you attached, we see this: > > dump_stack > xfs_buf_stale > xfs_trans_binval > xfs_bmap_btree_to_extents <<<<<<<<< > xfs_bunmapi > xfs_itruncate_extents > > So I'd say this is a pretty clear indication that we're immediately > recycling freed blocks from the free list here.... > >> The file example-stale.txt contains a backtrace of the case where we >> are being marked as stale. It seems to be happening when we convert >> the the inode's extents from unwritten to real. Can this case be >> avoided? I won't pretend I know the intricacies of this, but couldn't >> we be keeping extents from the very beginning to avoid creating stale >> buffers? >> >> 2) xfs_buf_lock -> down >> This is one I truly don't understand. What can be causing contention >> in this lock? We never have two different cores writing to the same >> buffer, nor should we have the same core doingCAP_FOWNER so. > > As Brian pointed out, this is probably AGF or AGI contention - > attempting to allocate/free extents or inodes in the same AG at the > same time will show this sort of pattern. This trace shows AGF > contention: > > down > xfs_buf_lock > _xfs_buf_find > xfs_buf_get_map > xfs_buf_read_map > xfs_trans_read_buf_map > xfs_read_agf > ..... > > This trace shows AGI contention: > > down > xfs_buf_lock > _xfs_buf_find > xfs_buf_get_map > xfs_buf_read_map > xfs_trans_read_buf_map > xfs_read_agi > .... > Great. I will take a look at how can we mitigate those on our side. I will need some time to understand all of that better, so for now I'd just leave you guys with a big thank you. >> 3) xfs_file_aio_write_checks -> file_update_time -> xfs_vn_update_time > > This trace?: > > rwsem_down_write_failed > call_rwsem_down_write_failed > xfs_ilock > xfs_vn_update_time > file_update_time > xfs_file_aio_write_checks > xfs_file_dio_aio_write > xfs_file_write_iter > aio_run_iocb > do_io_submit > sys_io_submit > entry_SYSCALL_64_fastpath > io_submit > 0x46d98a > > Which is an mtime timestamp update racing with another operation > that takes the internal metadata lock (e.g. block mapping/allocation > for that inode). > >> You guys seem to have an interface to avoid that, by setting the >> FMODE_NOCMTIME flag. > > We've talked about exposing this through open() for Ceph. > > http://www.kernelhub.org/?p=2&msg=744325 > https://lkml.org/lkml/2015/5/15/671 > > Read the first thread for why it's problematic to expose this to > userspace - I won't repeat it all here. > > As it is, there was some code recently hacked into ext4 to reduce > mtime overhead - the MS_LAZYTIME superblock option. What it does it > prevent the inode from being marked dirty when timestamps are > updated and hence the timestamps are never journalled or written > until something else marks the inode metadata dirty (e.g. block > allocation). ext4 gets away with this because it doesn't actually > journal timestamp changes - they get captured in the journal by > other modifications that are journalled, but still rely on the inod > being marked dirty for fsync, writeback and inode cache eviction > doing the right thing. > > The ext4 implementation looks like the timestamp updates can be > thrown away, as the inodes are not marked dirty and so on memory > pressure they will simply be reclaimed without writing back the > updated timestamps that are held in memory. I suspect fsync will > also have problems on ext4 as the inode is not metadata dirty or > journalled, and hence the timestamp changes will never get written > back. > > And, IMO, the worst part of the ext4 implementation is that the > inode buffer writeback code in ext4 now checks to see if any of the > other inodes in the buffer being written back need to have their > inode timestamps updated. IOWs, ext4 now does writeback of > /unjournalled metadata/ to inodes that are purely timestamp dirty. > > We /used/ to do shit like this in XFS. We got rid of it in > preference of journalling everything because the corner cases in log > recovery meant that after a crash the inodes were in inconsistent > states, and that meant we had unexpected, unpredictable recovery > behaviour where files weren't the expected size and/or didn't > contain the expected data. Hence going back to the bad old days of > hacking around the journal "for speed" doesn't exactly fill me with > joy. > > Let me have a think about how we can implement lazytime in a sane > way, such that fsync() works correctly, we don't throw away > timstamp changes in memory reclaim and we don't write unlogged > changes to the on-disk locations.... I trust you fully for matters related to speed. Keep in mind, though, that at least for us the fact that it blocks is a lot worse than the fact that it is slow. We can work around slow, but blocking basically means that we won't have any more work to push - since we don't do threading. The processor that stales just sits idle until the lock is released. So any non-blocking solution to this would already be a win for us. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: sleeps and waits during io_submit 2015-11-30 23:51 ` Glauber Costa @ 2015-12-01 20:30 ` Dave Chinner 0 siblings, 0 replies; 58+ messages in thread From: Dave Chinner @ 2015-12-01 20:30 UTC (permalink / raw) To: Glauber Costa; +Cc: Avi Kivity, xfs On Mon, Nov 30, 2015 at 06:51:51PM -0500, Glauber Costa wrote: > On Mon, Nov 30, 2015 at 6:10 PM, Dave Chinner <david@fromorbit.com> wrote: > > Let me have a think about how we can implement lazytime in a sane > > way, such that fsync() works correctly, we don't throw away > > timstamp changes in memory reclaim and we don't write unlogged > > changes to the on-disk locations.... > > I trust you fully for matters related to speed. > > Keep in mind, though, that at least for us the fact that it blocks is > a lot worse than the fact that it is slow. We can work around slow, > but blocking basically means that we won't have any more work to push > - since we don't do threading. The processor that stales just sits > idle until the lock is released. So any non-blocking solution to this > would already be a win for us. Right, the blocking is on the inode lock needed to do the transactional update of the timestamp. lazytime would need to avoid the timestamp update transaction completely, but we still need to capture the timestamp and run the transaction later or capture it in a subsequent change before we write back the inode. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 58+ messages in thread
end of thread, other threads:[~2015-12-09 8:37 UTC | newest] Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-11-28 2:43 sleeps and waits during io_submit Glauber Costa 2015-11-30 14:10 ` Brian Foster 2015-11-30 14:29 ` Avi Kivity 2015-11-30 16:14 ` Brian Foster 2015-12-01 9:08 ` Avi Kivity 2015-12-01 13:11 ` Brian Foster 2015-12-01 13:58 ` Avi Kivity 2015-12-01 14:01 ` Glauber Costa 2015-12-01 14:37 ` Avi Kivity 2015-12-01 20:45 ` Dave Chinner 2015-12-01 20:56 ` Avi Kivity 2015-12-01 23:41 ` Dave Chinner 2015-12-02 8:23 ` Avi Kivity 2015-12-01 14:56 ` Brian Foster 2015-12-01 15:22 ` Avi Kivity 2015-12-01 16:01 ` Brian Foster 2015-12-01 16:08 ` Avi Kivity 2015-12-01 16:29 ` Brian Foster 2015-12-01 17:09 ` Avi Kivity 2015-12-01 18:03 ` Carlos Maiolino 2015-12-01 19:07 ` Avi Kivity 2015-12-01 21:19 ` Dave Chinner 2015-12-01 21:38 ` Avi Kivity 2015-12-01 23:06 ` Dave Chinner 2015-12-02 9:02 ` Avi Kivity 2015-12-02 12:57 ` Carlos Maiolino 2015-12-02 23:19 ` Dave Chinner 2015-12-03 12:52 ` Avi Kivity 2015-12-04 3:16 ` Dave Chinner 2015-12-08 13:52 ` Avi Kivity 2015-12-08 23:13 ` Dave Chinner 2015-12-01 18:51 ` Brian Foster 2015-12-01 19:07 ` Glauber Costa 2015-12-01 19:35 ` Brian Foster 2015-12-01 19:45 ` Avi Kivity 2015-12-01 19:26 ` Avi Kivity 2015-12-01 19:41 ` Christoph Hellwig 2015-12-01 19:50 ` Avi Kivity 2015-12-02 0:13 ` Brian Foster 2015-12-02 0:57 ` Dave Chinner 2015-12-02 8:38 ` Avi Kivity 2015-12-02 8:34 ` Avi Kivity 2015-12-08 6:03 ` Dave Chinner 2015-12-08 13:56 ` Avi Kivity 2015-12-08 23:32 ` Dave Chinner 2015-12-09 8:37 ` Avi Kivity 2015-12-01 21:04 ` Dave Chinner 2015-12-01 21:10 ` Glauber Costa 2015-12-01 21:39 ` Dave Chinner 2015-12-01 21:24 ` Avi Kivity 2015-12-01 21:31 ` Glauber Costa 2015-11-30 15:49 ` Glauber Costa 2015-12-01 13:11 ` Brian Foster 2015-12-01 13:39 ` Glauber Costa 2015-12-01 14:02 ` Brian Foster 2015-11-30 23:10 ` Dave Chinner 2015-11-30 23:51 ` Glauber Costa 2015-12-01 20:30 ` Dave Chinner
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.