All of lore.kernel.org
 help / color / mirror / Atom feed
* High memory usage kills OSD while peering
@ 2017-08-17 14:13 Linux Chips
  2017-08-17 17:53 ` Gregory Farnum
  0 siblings, 1 reply; 27+ messages in thread
From: Linux Chips @ 2017-08-17 14:13 UTC (permalink / raw)
  To: ceph-devel

Hello everybody,
I have Kraken cluster with 660 OSD, currently it is down due to not
being able to complete peering, OSDs start consuming lots of memory
draining the system and killing the node, so I set a limit on the OSD
service (on some OSDs 28G and others as high as 35G), so they get
killed before taking down the whole node.
Now I still can't peer, one OSD entering the cluster (with about 300
already up) makes memory usage of most other OSDs so high (15G+, some as much as 30G) and
sometimes kills them when they reach the service limit. which cause a spiral load and causing all the OSDs to consume all the available.

I found this thread with similar symptoms:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-April/017522.html

with a request for stack trace, I have a 14G core dump, we generated it by running the osd from the terminal, enabling the core dumps, and setting ulimits to 15G. what kind of a trace would be useful? all thread?! any better way to debug this?

What can I do do make it work, is this memory allocation normal?

some info about the cluster:
41 hdd nodes with 12 x 4TB osd each, 5 of the nodes have 8TB disks. 324 GB RAM and dula socket intel xeon.
7 nodes with 400GB x 24 ssd and 256GB RAM, and dual socket cpu.
3 monitors

all dual 10GB ethernet, except for the monitor with dual 1GB ethers.

all nodes running centos 7.2
it is an old cluster that was upgraded continuously for the past 3 years. the cluster was on jewel when the issue happened due to some accidental OSD map changes, causing a heavy recovery operations on the cluster. then we upgraded to kraken in the hope of less memory foot prints.

any advice on how to proceed?

Thanks in advance


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: High memory usage kills OSD while peering
  2017-08-17 14:13 High memory usage kills OSD while peering Linux Chips
@ 2017-08-17 17:53 ` Gregory Farnum
  2017-08-17 18:51   ` Linux Chips
  0 siblings, 1 reply; 27+ messages in thread
From: Gregory Farnum @ 2017-08-17 17:53 UTC (permalink / raw)
  To: Linux Chips; +Cc: ceph-devel

On Thu, Aug 17, 2017 at 7:13 AM, Linux Chips <linux.chips@gmail.com> wrote:
> Hello everybody,
> I have Kraken cluster with 660 OSD, currently it is down due to not
> being able to complete peering, OSDs start consuming lots of memory
> draining the system and killing the node, so I set a limit on the OSD
> service (on some OSDs 28G and others as high as 35G), so they get
> killed before taking down the whole node.
> Now I still can't peer, one OSD entering the cluster (with about 300
> already up) makes memory usage of most other OSDs so high (15G+, some as
> much as 30G) and
> sometimes kills them when they reach the service limit. which cause a spiral
> load and causing all the OSDs to consume all the available.
>
> I found this thread with similar symptoms:
>
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-April/017522.html
>
> with a request for stack trace, I have a 14G core dump, we generated it by
> running the osd from the terminal, enabling the core dumps, and setting
> ulimits to 15G. what kind of a trace would be useful? all thread?! any
> better way to debug this?
>
> What can I do do make it work, is this memory allocation normal?
>
> some info about the cluster:
> 41 hdd nodes with 12 x 4TB osd each, 5 of the nodes have 8TB disks. 324 GB
> RAM and dula socket intel xeon.
> 7 nodes with 400GB x 24 ssd and 256GB RAM, and dual socket cpu.
> 3 monitors
>
> all dual 10GB ethernet, except for the monitor with dual 1GB ethers.
>
> all nodes running centos 7.2
> it is an old cluster that was upgraded continuously for the past 3 years.
> the cluster was on jewel when the issue happened due to some accidental OSD
> map changes, causing a heavy recovery operations on the cluster. then we
> upgraded to kraken in the hope of less memory foot prints.
>
> any advice on how to proceed?

It's not normal but if something really bad happened to your cluster,
it's been known to occur. You should go through the troubleshooting
guides at docs.ceph.com, but the general strategy is to set
nodown/noout/etc flags, undo whatever horrible thing you tried to make
the map do, and then turn all the OSDs back on.
-Greg

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: High memory usage kills OSD while peering
  2017-08-17 17:53 ` Gregory Farnum
@ 2017-08-17 18:51   ` Linux Chips
  2017-08-19 16:38     ` Mustafa Muhammad
                       ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Linux Chips @ 2017-08-17 18:51 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel



On 08/17/2017 08:53 PM, Gregory Farnum wrote:
> On Thu, Aug 17, 2017 at 7:13 AM, Linux Chips <linux.chips@gmail.com> wrote:
>> Hello everybody,
>> I have Kraken cluster with 660 OSD, currently it is down due to not
>> being able to complete peering, OSDs start consuming lots of memory
>> draining the system and killing the node, so I set a limit on the OSD
>> service (on some OSDs 28G and others as high as 35G), so they get
>> killed before taking down the whole node.
>> Now I still can't peer, one OSD entering the cluster (with about 300
>> already up) makes memory usage of most other OSDs so high (15G+, some as
>> much as 30G) and
>> sometimes kills them when they reach the service limit. which cause a spiral
>> load and causing all the OSDs to consume all the available.
>>
>> I found this thread with similar symptoms:
>>
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-April/017522.html
>>
>> with a request for stack trace, I have a 14G core dump, we generated it by
>> running the osd from the terminal, enabling the core dumps, and setting
>> ulimits to 15G. what kind of a trace would be useful? all thread?! any
>> better way to debug this?
>>
>> What can I do do make it work, is this memory allocation normal?
>>
>> some info about the cluster:
>> 41 hdd nodes with 12 x 4TB osd each, 5 of the nodes have 8TB disks. 324 GB
>> RAM and dula socket intel xeon.
>> 7 nodes with 400GB x 24 ssd and 256GB RAM, and dual socket cpu.
>> 3 monitors
>>
>> all dual 10GB ethernet, except for the monitor with dual 1GB ethers.
>>
>> all nodes running centos 7.2
>> it is an old cluster that was upgraded continuously for the past 3 years.
>> the cluster was on jewel when the issue happened due to some accidental OSD
>> map changes, causing a heavy recovery operations on the cluster. then we
>> upgraded to kraken in the hope of less memory foot prints.
>>
>> any advice on how to proceed?
> It's not normal but if something really bad happened to your cluster,
> it's been known to occur. You should go through the troubleshooting
> guides at docs.ceph.com, but the general strategy is to set
> nodown/noout/etc flags, undo whatever horrible thing you tried to make
> the map do, and then turn all the OSDs back on.
> -Greg

Hi,
we have been trying this for the past week, it keeps consuming the RAM.
we got the map back to the original places. marked all the flags, 
started all the OSDs. then "ceph osd unset noup", wait 5 min, and all 
OSDs are killed by the oom.
we tried one node at a time, let it finish recovering, and start the 
next. we got to a point when we started the next node, every thing got 
killed.
we tried one OSD at a time, same result. one OSD up, ~40 killed by oom, 
then it is a snow ball from here until all of the active OSDs get kiiled.

I think all this up/down that we generated has increased the recovery 
too much. btw, we stopped all clients. and also we have some not so 
friendly erasure pools. some OSDs now report loading as much as 800 pg, 
while we originally had about 300-400 (I know too much, but we were 
trying to fix it and.... well we could not).

we did a memory profiling on one of the OSDs.
here is the results


  12878.6  47.6%  47.6%  12878.6  47.6% std::_Rb_tree::_M_create_node
  12867.6  47.6%  95.2%  25746.2  95.2% std::_Rb_tree::_M_copy
    532.4   2.0%  97.2%    686.3   2.5% OSD::heartbeat
    122.8   0.5%  97.7%    122.8   0.5% 
std::_Rb_tree::_M_emplace_hint_unique
    121.9   0.5%  98.1%    171.1   0.6% AsyncConnection::send_message
    104.2   0.4%  98.5%    104.2   0.4% ceph::buffer::list::append@c4a770
     99.7   0.4%  98.9%     99.7   0.4% std::vector::_M_default_append
     99.6   0.4%  99.2%     99.6   0.4% ceph::logging::Log::create_entry
     72.6   0.3%  99.5%     72.6   0.3% ceph::buffer::create_aligned
     52.4   0.2%  99.7%     52.5   0.2% std::vector::_M_emplace_back_aux
     23.9   0.1%  99.8%     57.8   0.2% OSD::do_notifies
     17.0   0.1%  99.8%     23.1   0.1% 
OSDService::build_incremental_map_msg
      9.8   0.0%  99.9%    222.5   0.8% std::enable_if::type decode
      6.2   0.0%  99.9%      6.3   0.0% std::map::operator[]
      5.5   0.0%  99.9%      5.5   0.0% std::vector::vector
      3.5   0.0%  99.9%      3.5   0.0% EventCenter::create_time_event
      2.5   0.0%  99.9%      2.5   0.0% AsyncConnection::AsyncConnection
      2.4   0.0% 100.0%      2.4   0.0% std::string::_Rep::_S_create
      1.5   0.0% 100.0%      1.5   0.0% std::_Rb_tree::_M_insert_unique
      1.4   0.0% 100.0%      1.4   0.0% std::list::operator=
      1.3   0.0% 100.0%      1.3   0.0% ceph::buffer::list::list
      0.9   0.0% 100.0%    204.1   0.8% decode_message
      0.8   0.0% 100.0%      0.8   0.0% OSD::send_failures
      0.7   0.0% 100.0%      0.9   0.0% void decode
      0.6   0.0% 100.0%      0.6   0.0% std::_Rb_tree::_M_insert_equal
      0.6   0.0% 100.0%      2.5   0.0% PG::queue_null
      0.6   0.0% 100.0%      1.8   0.0% AsyncMessenger::create_connect
      0.6   0.0% 100.0%      1.8   0.0% AsyncMessenger::add_accept
      0.5   0.0% 100.0%      0.5   0.0% boost::statechart::event::clone
      0.4   0.0% 100.0%      0.4   0.0% PG::queue_peering_event
      0.3   0.0% 100.0%      0.3   0.0% OSD::PeeringWQ::_enqueue
      0.3   0.0% 100.0%    148.6   0.5% OSD::_dispatch
      0.1   0.0% 100.0%    147.9   0.5% OSD::handle_osd_map
      0.1   0.0% 100.0%      0.1   0.0% std::deque::_M_push_back_aux
      0.1   0.0% 100.0%      0.2   0.0% SharedLRU::add
      0.1   0.0% 100.0%      0.1   0.0% OSD::PeeringWQ::_dequeue
      0.1   0.0% 100.0%      0.1   0.0% ceph::buffer::list::append@c4a9b0
      0.1   0.0% 100.0%      0.2   0.0% DispatchQueue::enqueue
      0.1   0.0% 100.0%    283.5   1.0% EventCenter::process_events
      0.1   0.0% 100.0%      0.1   0.0% HitSet::Params::create_impl
      0.1   0.0% 100.0%      0.1   0.0% SimpleLRU::clear_pinned
      0.0   0.0% 100.0%      0.0   0.0% std::_Rb_tree::_M_insert_
      0.0   0.0% 100.0%      0.2   0.0% TrackedOp::mark_event
      0.0   0.0% 100.0%      0.0   0.0% OSD::create_context
      0.0   0.0% 100.0%      0.0   0.0% std::_Hashtable::_M_allocate_node
      0.0   0.0% 100.0%      0.0   0.0% OSDMap::OSDMap
      0.0   0.0% 100.0%    281.6   1.0% AsyncConnection::process
      0.0   0.0% 100.0%  25802.4  95.4% 
PG::RecoveryState::RecoveryMachine::send_notify
      0.0   0.0% 100.0%      0.0   0.0% SharedLRU::lru_add
      0.0   0.0% 100.0%      0.0   0.0% std::_Rb_tree::_M_insert_unique_
      0.0   0.0% 100.0%      0.1   0.0% OpTracker::unregister_inflight_op
      0.0   0.0% 100.0%      0.0   0.0% OSD::ms_verify_authorizer
      0.0   0.0% 100.0%      0.0   0.0% OSDService::_add_map
      0.0   0.0% 100.0%      0.1   0.0% OSD::wait_for_new_map
      0.0   0.0% 100.0%      0.5   0.0% OSD::handle_pg_notify
      0.0   0.0% 100.0%      0.0   0.0% std::__shared_count::__shared_count
      0.0   0.0% 100.0%      0.0   0.0% std::__shared_ptr::reset
      0.0   0.0% 100.0%     35.1   0.1% OSDMap::decode@b84080
      0.0   0.0% 100.0%      0.0   0.0% std::_Rb_tree::_M_emplace_unique
      0.0   0.0% 100.0%      0.0   0.0% std::vector::operator=
      0.0   0.0% 100.0%      0.0   0.0% MonClient::_renew_subs
      0.0   0.0% 100.0%      0.0   0.0% std::_Hashtable::_M_emplace
      0.0   0.0% 100.0%      0.0   0.0% PORT_Alloc_Util
      0.0   0.0% 100.0%      0.0   0.0% CryptoAES::get_key_handler
      0.0   0.0% 100.0%      0.0   0.0% get_auth_session_handler
      0.0   0.0% 100.0%      0.0   0.0% PosixWorker::connect
      0.0   0.0% 100.0%      0.0   0.0% ceph::buffer::list::append@c4a440
      0.0   0.0% 100.0%      0.0   0.0% std::vector::_M_fill_insert
      0.0   0.0% 100.0%      4.8   0.0% AsyncConnection::fault
      0.0   0.0% 100.0%      0.0   0.0% OSD::send_pg_stats
      0.0   0.0% 100.0%      0.0   0.0% AsyncMessenger::accept_conn
      0.0   0.0% 100.0%      0.0   0.0% PosixServerSocketImpl::accept
      0.0   0.0% 100.0%      9.3   0.0% AsyncConnection::_process_connection
      0.0   0.0% 100.0%      0.2   0.0% FileStore::lfn_open
      0.0   0.0% 100.0%      0.0   0.0% ceph::buffer::list::append@c4a350
      0.0   0.0% 100.0%      0.0   0.0% crush_create
      0.0   0.0% 100.0%      0.1   0.0% MgrClient::send_report
      0.0   0.0% 100.0%      0.0   0.0% WBThrottle::queue_wb
      0.0   0.0% 100.0%      0.2   0.0% LogClient::_get_mon_log_message
      0.0   0.0% 100.0%      0.0   0.0% CryptoKey::_set_secret
      0.0   0.0% 100.0%      0.0   0.0% std::_Deque_base::_M_initialize_map
      0.0   0.0% 100.0%      0.1   0.0% 
ThreadPool::BatchWorkQueue::_void_dequeue
      0.0   0.0% 100.0%      0.0   0.0% ceph::Formatter::create@ba6a50
      0.0   0.0% 100.0%      0.0   0.0% MonClient::schedule_tick
      0.0   0.0% 100.0%      0.1   0.0% OSD::tick
      0.0   0.0% 100.0%     37.6   0.1% OSD::tick_without_osd_lock
      0.0   0.0% 100.0%      0.0   0.0% 
boost::spirit::classic::impl::get_definition
      0.0   0.0% 100.0%      9.4   0.0% MonClient::_send_mon_message
      0.0   0.0% 100.0%      0.0   0.0% DispatchQueue::queue_refused
      0.0   0.0% 100.0%      0.0   0.0% OSD::handle_command
      0.0   0.0% 100.0%      0.0   0.0% DispatchQueue::queue_accept
      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::_connect
      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::_stop
      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::accept
      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::handle_connect_msg
      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::mark_down
      0.0   0.0% 100.0%      0.0   0.0% 
AsyncConnection::prepare_send_message
      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::read_bulk
      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::read_until
      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::send_keepalive
      0.0   0.0% 100.0%      3.3   0.0% AsyncConnection::wakeup_from
      0.0   0.0% 100.0%      1.8   0.0% AsyncMessenger::get_connection
      0.0   0.0% 100.0%      0.0   0.0% AsyncMessenger::reap_dead
      0.0   0.0% 100.0%      2.5   0.0% C_OnMapCommit::finish
      0.0   0.0% 100.0%      0.0   0.0% 
CephXTicketHandler::verify_service_ticket_reply
      0.0   0.0% 100.0%      0.0   0.0% 
CephXTicketManager::verify_service_ticket_reply
      0.0   0.0% 100.0%      0.0   0.0% 
CephxAuthorizeHandler::verify_authorizer
      0.0   0.0% 100.0%      0.0   0.0% CephxClientHandler::handle_response
      0.0   0.0% 100.0%     40.9   0.2% Context::complete
      0.0   0.0% 100.0%      4.8   0.0% CrushWrapper::encode
      0.0   0.0% 100.0%      0.0   0.0% CryptoAESKeyHandler::decrypt
      0.0   0.0% 100.0%      0.0   0.0% CryptoKey::decode
      0.0   0.0% 100.0%    160.4   0.6% DispatchQueue::DispatchThread::entry
      0.0   0.0% 100.0%    160.4   0.6% DispatchQueue::entry
      0.0   0.0% 100.0%      0.4   0.0% DispatchQueue::fast_dispatch
      0.0   0.0% 100.0%      0.4   0.0% DispatchQueue::pre_dispatch
      0.0   0.0% 100.0%      0.0   0.0% EntityName::set
      0.0   0.0% 100.0%      0.0   0.0% EpollDriver::event_wait
      0.0   0.0% 100.0%      3.0   0.0% EventCenter::dispatch_event_external
      0.0   0.0% 100.0%      3.3   0.0% EventCenter::process_time_events
      0.0   0.0% 100.0%      3.0   0.0% EventCenter::wakeup
      0.0   0.0% 100.0%      0.0   0.0% FileJournal::prepare_entry
      0.0   0.0% 100.0%      0.2   0.0% FileStore::_do_op
      0.0   0.0% 100.0%      0.2   0.0% FileStore::_do_transaction
      0.0   0.0% 100.0%      0.2   0.0% FileStore::_do_transactions
      0.0   0.0% 100.0%      0.0   0.0% FileStore::_journaled_ahead
      0.0   0.0% 100.0%      0.2   0.0% FileStore::_write
      0.0   0.0% 100.0%      0.0   0.0% FileStore::queue_transactions
      0.0   0.0% 100.0%      2.6   0.0% Finisher::finisher_thread_entry
      0.0   0.0% 100.0%      0.1   0.0% FunctionContext::finish
      0.0   0.0% 100.0%      0.1   0.0% HitSet::Params::decode
      0.0   0.0% 100.0%      0.2   0.0% LogChannel::do_log@a90a00
      0.0   0.0% 100.0%      0.3   0.0% LogChannel::do_log@a91030
      0.0   0.0% 100.0%      0.2   0.0% LogClient::get_mon_log_message
      0.0   0.0% 100.0%      0.0   0.0% LogClient::handle_log_ack
      0.0   0.0% 100.0%      0.1   0.0% LogClient::queue
      0.0   0.0% 100.0%      0.3   0.0% LogClientTemp::~LogClientTemp
      0.0   0.0% 100.0%      0.0   0.0% MAuthReply::decode_payload
      0.0   0.0% 100.0%      0.0   0.0% MCommand::decode_payload
      0.0   0.0% 100.0%      0.0   0.0% MCommand::print
      0.0   0.0% 100.0%      0.0   0.0% MMgrMap::decode_payload
      0.0   0.0% 100.0%      0.0   0.0% MOSDFailure::print
      0.0   0.0% 100.0%      0.1   0.0% MOSDMap::decode_payload
      0.0   0.0% 100.0%    203.1   0.8% MOSDPGNotify::decode_payload
      0.0   0.0% 100.0%      0.0   0.0% MOSDPGNotify::print
      0.0   0.0% 100.0%      0.0   0.0% MOSDPing::encode_payload
      0.0   0.0% 100.0%      0.0   0.0% Message::encode
      0.0   0.0% 100.0%      0.0   0.0% MgrClient::handle_mgr_map
      0.0   0.0% 100.0%      0.0   0.0% MgrClient::ms_dispatch
      0.0   0.0% 100.0%      0.0   0.0% MgrMap::decode
      0.0   0.0% 100.0%      0.0   0.0% MonClient::_check_auth_rotating
      0.0   0.0% 100.0%      0.0   0.0% MonClient::_check_auth_tickets
      0.0   0.0% 100.0%      0.0   0.0% MonClient::_finish_hunting
      0.0   0.0% 100.0%      0.8   0.0% MonClient::_reopen_session@aeab80
      0.0   0.0% 100.0%      0.6   0.0% MonClient::_reopen_session@af2ba0
      0.0   0.0% 100.0%      9.5   0.0% MonClient::handle_auth
      0.0   0.0% 100.0%      9.6   0.0% MonClient::ms_dispatch
      0.0   0.0% 100.0%      0.2   0.0% MonClient::send_log
      0.0   0.0% 100.0%      0.6   0.0% MonClient::tick
      0.0   0.0% 100.0%    283.5   1.0% NetworkStack::get_worker
      0.0   0.0% 100.0%      0.0   0.0% OSD::CommandWQ::_process
      0.0   0.0% 100.0%  25862.5  95.7% OSD::PeeringWQ::_process
      0.0   0.0% 100.0%      0.0   0.0% OSD::Session::Session
      0.0   0.0% 100.0%    686.3   2.5% OSD::T_Heartbeat::entry
      0.0   0.0% 100.0%      2.5   0.0% OSD::_committed_osd_maps
      0.0   0.0% 100.0%  25804.6  95.5% OSD::advance_pg
      0.0   0.0% 100.0%      0.3   0.0% OSD::check_ops_in_flight
      0.0   0.0% 100.0%      0.0   0.0% OSD::check_osdmap_features
      0.0   0.0% 100.0%      2.5   0.0% OSD::consume_map
      0.0   0.0% 100.0%     57.8   0.2% OSD::dispatch_context
      0.0   0.0% 100.0%      0.5   0.0% OSD::dispatch_op
      0.0   0.0% 100.0%      0.0   0.0% OSD::do_command
      0.0   0.0% 100.0%      0.2   0.0% OSD::do_waiters
      0.0   0.0% 100.0%      0.0   0.0% OSD::get_osdmap_pobject_name
      0.0   0.0% 100.0%      0.1   0.0% OSD::handle_osd_ping
      0.0   0.0% 100.0%      0.0   0.0% OSD::handle_pg_peering_evt
      0.0   0.0% 100.0%     37.2   0.1% OSD::heartbeat_check
      0.0   0.0% 100.0%      0.1   0.0% OSD::heartbeat_dispatch
      0.0   0.0% 100.0%    686.3   2.5% OSD::heartbeat_entry
      0.0   0.0% 100.0%      1.1   0.0% OSD::heartbeat_reset
      0.0   0.0% 100.0%    148.7   0.6% OSD::ms_dispatch
      0.0   0.0% 100.0%      0.8   0.0% OSD::ms_handle_connect
      0.0   0.0% 100.0%      0.0   0.0% OSD::ms_handle_refused
      0.0   0.0% 100.0%      0.0   0.0% OSD::ms_handle_reset
      0.0   0.0% 100.0%  25862.5  95.7% OSD::process_peering_events
      0.0   0.0% 100.0%      0.1   0.0% OSD::require_same_or_newer_map
      0.0   0.0% 100.0%      0.0   0.0% OSD::write_superblock
      0.0   0.0% 100.0%      0.0   0.0% OSDCap::parse
      0.0   0.0% 100.0%      0.0   0.0% OSDMap::Incremental::decode
      0.0   0.0% 100.0%     35.1   0.1% OSDMap::decode@b85440
      0.0   0.0% 100.0%    110.8   0.4% OSDMap::encode
      0.0   0.0% 100.0%      0.5   0.0% OSDMap::post_decode
      0.0   0.0% 100.0%      0.1   0.0% OSDService::_get_map_bl
      0.0   0.0% 100.0%      0.0   0.0% OSDService::check_nearfull_warning
      0.0   0.0% 100.0%      0.1   0.0% OSDService::clear_map_bl_cache_pins
      0.0   0.0% 100.0%      1.1   0.0% OSDService::get_con_osd_hb
      0.0   0.0% 100.0%      1.3   0.0% OSDService::get_inc_map_bl
      0.0   0.0% 100.0%      1.3   0.0% OSDService::pin_map_bl
      0.0   0.0% 100.0%      0.0   0.0% OSDService::pin_map_inc_bl
      0.0   0.0% 100.0%      0.0   0.0% OSDService::publish_superblock
      0.0   0.0% 100.0%      0.3   0.0% OSDService::queue_for_peering
      0.0   0.0% 100.0%     27.2   0.1% OSDService::send_incremental_map
      0.0   0.0% 100.0%     27.2   0.1% OSDService::share_map_peer
      0.0   0.0% 100.0%      0.0   0.0% OSDService::update_osd_stat
      0.0   0.0% 100.0%      0.0   0.0% 
ObjectStore::Transaction::_get_coll_id
      0.0   0.0% 100.0%      0.0   0.0% 
ObjectStore::Transaction::_get_next_op
      0.0   0.0% 100.0%      0.2   0.0% ObjectStore::Transaction::write
      0.0   0.0% 100.0%      0.0   0.0% ObjectStore::queue_transaction
      0.0   0.0% 100.0%      0.0   0.0% Objecter::_maybe_request_map
      0.0   0.0% 100.0%      0.1   0.0% Objecter::handle_osd_map
      0.0   0.0% 100.0%      0.1   0.0% OpHistory::insert
      0.0   0.0% 100.0%      0.0   0.0% OpRequest::OpRequest
      0.0   0.0% 100.0%      0.1   0.0% OpRequest::mark_flag_point
      0.0   0.0% 100.0%      0.1   0.0% OpRequest::mark_started
      0.0   0.0% 100.0%      0.1   0.0% OpTracker::RemoveOnDelete::operator
      0.0   0.0% 100.0%      0.1   0.0% OpTracker::_mark_event
      0.0   0.0% 100.0%      0.0   0.0% OpTracker::get_age_ms_histogram
      0.0   0.0% 100.0%  25802.4  95.4% PG::RecoveryState::Stray::react
      0.0   0.0% 100.0%      0.0   0.0% PG::_prepare_write_info
      0.0   0.0% 100.0%  25802.4  95.4% PG::handle_activate_map
      0.0   0.0% 100.0%      1.6   0.0% PG::handle_advance_map
      0.0   0.0% 100.0%      0.0   0.0% PG::prepare_write_info
      0.0   0.0% 100.0%      0.0   0.0% PG::write_if_dirty
      0.0   0.0% 100.0%      1.6   0.0% PGPool::update
      0.0   0.0% 100.0%      0.0   0.0% PK11_FreeSymKey
      0.0   0.0% 100.0%      0.0   0.0% PK11_GetIVLength
      0.0   0.0% 100.0%      0.0   0.0% PK11_ImportSymKey
      0.0   0.0% 100.0%      0.0   0.0% PrebufferedStreambuf::overflow
      0.0   0.0% 100.0%      1.8   0.0% Processor::accept
      0.0   0.0% 100.0%      0.0   0.0% SECITEM_CopyItem_Util
      0.0   0.0% 100.0%      0.0   0.0% SafeTimer::add_event_after
      0.0   0.0% 100.0%      0.0   0.0% SafeTimer::add_event_at
      0.0   0.0% 100.0%     38.4   0.1% SafeTimer::timer_thread
      0.0   0.0% 100.0%     38.4   0.1% SafeTimerThread::entry
      0.0   0.0% 100.0%  25862.7  95.7% ThreadPool::WorkThread::entry
      0.0   0.0% 100.0%  25862.7  95.7% ThreadPool::worker
      0.0   0.0% 100.0%  27023.8 100.0% __clone
      0.0   0.0% 100.0%      0.0   0.0% 
boost::detail::function::void_function_obj_invoker2::invoke
      0.0   0.0% 100.0%      0.0   0.0% 
boost::proto::detail::default_assign::impl::operator
      0.0   0.0% 100.0%      0.0   0.0% 
boost::spirit::classic::impl::concrete_parser::do_parse_virtual
      0.0   0.0% 100.0%      0.0   0.0% boost::spirit::qi::action::parse
      0.0   0.0% 100.0%      0.3   0.0% 
boost::statechart::event_base::intrusive_from_this
      0.0   0.0% 100.0%  25802.4  95.4% 
boost::statechart::simple_state::react_impl
      0.0   0.0% 100.0%  25802.4  95.4% 
boost::statechart::state_machine::send_event
      0.0   0.0% 100.0%      0.0   0.0% ceph::Formatter::create@48b620
      0.0   0.0% 100.0%      0.4   0.0% 
ceph::buffer::list::contiguous_appender::contiguous_appender
      0.0   0.0% 100.0%      2.4   0.0% ceph::buffer::list::crc32c
      0.0   0.0% 100.0%      0.1   0.0% 
ceph::buffer::list::iterator_impl::copy
      0.0   0.0% 100.0%      0.0   0.0% 
ceph::buffer::list::iterator_impl::copy_deep
      0.0   0.0% 100.0%      5.7   0.0% 
ceph::buffer::list::iterator_impl::copy_shallow
      0.0   0.0% 100.0%      0.0   0.0% ceph::buffer::ptr::ptr
      0.0   0.0% 100.0%      0.0   0.0% ceph_heap_profiler_handle_command
      0.0   0.0% 100.0%      0.0   0.0% ceph_os_fremovexattr
      0.0   0.0% 100.0%      0.0   0.0% cephx_verify_authorizer
      0.0   0.0% 100.0%      0.0   0.0% cmdmap_from_json
      0.0   0.0% 100.0%      2.2   0.0% crush_hash_name
      0.0   0.0% 100.0%      0.1   0.0% decode
      0.0   0.0% 100.0%     20.1   0.1% entity_addr_t::encode
      0.0   0.0% 100.0%      0.0   0.0% get_str_vec
      0.0   0.0% 100.0%      0.0   0.0% int decode_decrypt@c15110
      0.0   0.0% 100.0%      0.0   0.0% int decode_decrypt@c15b90
      0.0   0.0% 100.0%      0.0   0.0% 
json_spirit::Semantic_actions::new_name
      0.0   0.0% 100.0%      0.0   0.0% 
json_spirit::Semantic_actions::new_str
      0.0   0.0% 100.0%      1.1   0.0% json_spirit::Value_impl::get_uint64
      0.0   0.0% 100.0%      0.0   0.0% json_spirit::get_str
      0.0   0.0% 100.0%      0.0   0.0% json_spirit::get_str_
      0.0   0.0% 100.0%      0.0   0.0% json_spirit::read
      0.0   0.0% 100.0%      0.0   0.0% json_spirit::read_range
      0.0   0.0% 100.0%      0.0   0.0% json_spirit::read_range_or_throw
      0.0   0.0% 100.0%      0.0   0.0% json_spirit::substitute_esc_chars
      0.0   0.0% 100.0%      0.0   0.0% operator<<@a91e90
      0.0   0.0% 100.0%      3.5   0.0% osd_info_t::encode
      0.0   0.0% 100.0%      4.4   0.0% osd_xinfo_t::encode
      0.0   0.0% 100.0%      0.1   0.0% pg_info_t::decode
      0.0   0.0% 100.0%      0.0   0.0% pg_info_t::operator=
      0.0   0.0% 100.0%      9.9   0.0% pg_info_t::pg_info_t
      0.0   0.0% 100.0%     87.5   0.3% pg_interval_t::decode
      0.0   0.0% 100.0%      1.0   0.0% pg_pool_t::decode
      0.0   0.0% 100.0%      1.8   0.0% pg_pool_t::encode
      0.0   0.0% 100.0%      0.0   0.0% pg_stat_t::decode
      0.0   0.0% 100.0%  27032.1 100.0% start_thread
      0.0   0.0% 100.0%      1.3   0.0% std::_Rb_tree::operator=
      0.0   0.0% 100.0%      0.1   0.0% std::_Sp_counted_base::_M_release
      0.0   0.0% 100.0%      0.0   0.0% std::__detail::_Map_base::operator[]
      0.0   0.0% 100.0%      0.0   0.0% std::__ostream_insert
      0.0   0.0% 100.0%      0.1   0.0% std::basic_streambuf::xsputn
      0.0   0.0% 100.0%      0.1   0.0% std::basic_string::basic_string
      0.0   0.0% 100.0%      0.0   0.0% std::basic_stringbuf::overflow
      0.0   0.0% 100.0%      1.0   0.0% std::basic_stringbuf::str
      0.0   0.0% 100.0%     71.0   0.3% std::enable_if::type encode
      0.0   0.0% 100.0%      0.1   0.0% std::getline
      0.0   0.0% 100.0%      0.0   0.0% std::num_put::_M_insert_int
      0.0   0.0% 100.0%      0.0   0.0% std::num_put::do_put
      0.0   0.0% 100.0%      0.0   0.0% std::operator<<
      0.0   0.0% 100.0%      0.0   0.0% std::ostream::_M_insert
      0.0   0.0% 100.0%      1.2   0.0% std::string::_Rep::_M_clone
      0.0   0.0% 100.0%      1.2   0.0% std::string::_S_construct
      0.0   0.0% 100.0%      1.2   0.0% std::string::append
      0.0   0.0% 100.0%      1.2   0.0% std::string::reserve
      0.0   0.0% 100.0%    283.5   1.0% std::this_thread::__sleep_for
      0.0   0.0% 100.0%      0.0   0.0% void decode_decrypt_enc_bl@c12db0
      0.0   0.0% 100.0%      0.0   0.0% void decode_decrypt_enc_bl@c14a80
      0.0   0.0% 100.0%      0.0   0.0% void decode_decrypt_enc_bl@c15450
      0.0   0.0% 100.0%     20.1   0.1% void encode


I also generated the PDf with all the charts, but not sure how to share 
it with you guys.
any Idea what is happening here ?

thanks
ali



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: High memory usage kills OSD while peering
  2017-08-17 18:51   ` Linux Chips
@ 2017-08-19 16:38     ` Mustafa Muhammad
  2017-08-22 22:33       ` Sage Weil
       [not found]     ` <CAGtbiz1eHTiaO4pWu4sU97E8N+=DthTXjbY_Ga9CONW862y2XQ@mail.gmail.com>
  2017-08-21 10:57     ` Linux Chips
  2 siblings, 1 reply; 27+ messages in thread
From: Mustafa Muhammad @ 2017-08-19 16:38 UTC (permalink / raw)
  To: ceph-devel

Hi all,
Looks like the memory is consumed in the
"PG::RecoveryState::RecoveryMachine::send_notify", is this related to
messenger? Can we get lower memory usage even if this mean slower
peering (or delayed recovery)?

Thanks in advance

Mustafa Muhammad


On Thu, Aug 17, 2017 at 9:51 PM, Linux Chips <linux.chips@gmail.com> wrote:
>
>
> On 08/17/2017 08:53 PM, Gregory Farnum wrote:
>>
>> On Thu, Aug 17, 2017 at 7:13 AM, Linux Chips <linux.chips@gmail.com>
>> wrote:
>>>
>>> Hello everybody,
>>> I have Kraken cluster with 660 OSD, currently it is down due to not
>>> being able to complete peering, OSDs start consuming lots of memory
>>> draining the system and killing the node, so I set a limit on the OSD
>>> service (on some OSDs 28G and others as high as 35G), so they get
>>> killed before taking down the whole node.
>>> Now I still can't peer, one OSD entering the cluster (with about 300
>>> already up) makes memory usage of most other OSDs so high (15G+, some as
>>> much as 30G) and
>>> sometimes kills them when they reach the service limit. which cause a
>>> spiral
>>> load and causing all the OSDs to consume all the available.
>>>
>>> I found this thread with similar symptoms:
>>>
>>>
>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-April/017522.html
>>>
>>> with a request for stack trace, I have a 14G core dump, we generated it
>>> by
>>> running the osd from the terminal, enabling the core dumps, and setting
>>> ulimits to 15G. what kind of a trace would be useful? all thread?! any
>>> better way to debug this?
>>>
>>> What can I do do make it work, is this memory allocation normal?
>>>
>>> some info about the cluster:
>>> 41 hdd nodes with 12 x 4TB osd each, 5 of the nodes have 8TB disks. 324
>>> GB
>>> RAM and dula socket intel xeon.
>>> 7 nodes with 400GB x 24 ssd and 256GB RAM, and dual socket cpu.
>>> 3 monitors
>>>
>>> all dual 10GB ethernet, except for the monitor with dual 1GB ethers.
>>>
>>> all nodes running centos 7.2
>>> it is an old cluster that was upgraded continuously for the past 3 years.
>>> the cluster was on jewel when the issue happened due to some accidental
>>> OSD
>>> map changes, causing a heavy recovery operations on the cluster. then we
>>> upgraded to kraken in the hope of less memory foot prints.
>>>
>>> any advice on how to proceed?
>>
>> It's not normal but if something really bad happened to your cluster,
>> it's been known to occur. You should go through the troubleshooting
>> guides at docs.ceph.com, but the general strategy is to set
>> nodown/noout/etc flags, undo whatever horrible thing you tried to make
>> the map do, and then turn all the OSDs back on.
>> -Greg
>
>
> Hi,
> we have been trying this for the past week, it keeps consuming the RAM.
> we got the map back to the original places. marked all the flags, started
> all the OSDs. then "ceph osd unset noup", wait 5 min, and all OSDs are
> killed by the oom.
> we tried one node at a time, let it finish recovering, and start the next.
> we got to a point when we started the next node, every thing got killed.
> we tried one OSD at a time, same result. one OSD up, ~40 killed by oom, then
> it is a snow ball from here until all of the active OSDs get kiiled.
>
> I think all this up/down that we generated has increased the recovery too
> much. btw, we stopped all clients. and also we have some not so friendly
> erasure pools. some OSDs now report loading as much as 800 pg, while we
> originally had about 300-400 (I know too much, but we were trying to fix it
> and.... well we could not).
>
> we did a memory profiling on one of the OSDs.
> here is the results
>
>
>  12878.6  47.6%  47.6%  12878.6  47.6% std::_Rb_tree::_M_create_node
>  12867.6  47.6%  95.2%  25746.2  95.2% std::_Rb_tree::_M_copy
>    532.4   2.0%  97.2%    686.3   2.5% OSD::heartbeat
>    122.8   0.5%  97.7%    122.8   0.5% std::_Rb_tree::_M_emplace_hint_unique
>    121.9   0.5%  98.1%    171.1   0.6% AsyncConnection::send_message
>    104.2   0.4%  98.5%    104.2   0.4% ceph::buffer::list::append@c4a770
>     99.7   0.4%  98.9%     99.7   0.4% std::vector::_M_default_append
>     99.6   0.4%  99.2%     99.6   0.4% ceph::logging::Log::create_entry
>     72.6   0.3%  99.5%     72.6   0.3% ceph::buffer::create_aligned
>     52.4   0.2%  99.7%     52.5   0.2% std::vector::_M_emplace_back_aux
>     23.9   0.1%  99.8%     57.8   0.2% OSD::do_notifies
>     17.0   0.1%  99.8%     23.1   0.1% OSDService::build_incremental_map_msg
>      9.8   0.0%  99.9%    222.5   0.8% std::enable_if::type decode
>      6.2   0.0%  99.9%      6.3   0.0% std::map::operator[]
>      5.5   0.0%  99.9%      5.5   0.0% std::vector::vector
>      3.5   0.0%  99.9%      3.5   0.0% EventCenter::create_time_event
>      2.5   0.0%  99.9%      2.5   0.0% AsyncConnection::AsyncConnection
>      2.4   0.0% 100.0%      2.4   0.0% std::string::_Rep::_S_create
>      1.5   0.0% 100.0%      1.5   0.0% std::_Rb_tree::_M_insert_unique
>      1.4   0.0% 100.0%      1.4   0.0% std::list::operator=
>      1.3   0.0% 100.0%      1.3   0.0% ceph::buffer::list::list
>      0.9   0.0% 100.0%    204.1   0.8% decode_message
>      0.8   0.0% 100.0%      0.8   0.0% OSD::send_failures
>      0.7   0.0% 100.0%      0.9   0.0% void decode
>      0.6   0.0% 100.0%      0.6   0.0% std::_Rb_tree::_M_insert_equal
>      0.6   0.0% 100.0%      2.5   0.0% PG::queue_null
>      0.6   0.0% 100.0%      1.8   0.0% AsyncMessenger::create_connect
>      0.6   0.0% 100.0%      1.8   0.0% AsyncMessenger::add_accept
>      0.5   0.0% 100.0%      0.5   0.0% boost::statechart::event::clone
>      0.4   0.0% 100.0%      0.4   0.0% PG::queue_peering_event
>      0.3   0.0% 100.0%      0.3   0.0% OSD::PeeringWQ::_enqueue
>      0.3   0.0% 100.0%    148.6   0.5% OSD::_dispatch
>      0.1   0.0% 100.0%    147.9   0.5% OSD::handle_osd_map
>      0.1   0.0% 100.0%      0.1   0.0% std::deque::_M_push_back_aux
>      0.1   0.0% 100.0%      0.2   0.0% SharedLRU::add
>      0.1   0.0% 100.0%      0.1   0.0% OSD::PeeringWQ::_dequeue
>      0.1   0.0% 100.0%      0.1   0.0% ceph::buffer::list::append@c4a9b0
>      0.1   0.0% 100.0%      0.2   0.0% DispatchQueue::enqueue
>      0.1   0.0% 100.0%    283.5   1.0% EventCenter::process_events
>      0.1   0.0% 100.0%      0.1   0.0% HitSet::Params::create_impl
>      0.1   0.0% 100.0%      0.1   0.0% SimpleLRU::clear_pinned
>      0.0   0.0% 100.0%      0.0   0.0% std::_Rb_tree::_M_insert_
>      0.0   0.0% 100.0%      0.2   0.0% TrackedOp::mark_event
>      0.0   0.0% 100.0%      0.0   0.0% OSD::create_context
>      0.0   0.0% 100.0%      0.0   0.0% std::_Hashtable::_M_allocate_node
>      0.0   0.0% 100.0%      0.0   0.0% OSDMap::OSDMap
>      0.0   0.0% 100.0%    281.6   1.0% AsyncConnection::process
>      0.0   0.0% 100.0%  25802.4  95.4%
> PG::RecoveryState::RecoveryMachine::send_notify
>      0.0   0.0% 100.0%      0.0   0.0% SharedLRU::lru_add
>      0.0   0.0% 100.0%      0.0   0.0% std::_Rb_tree::_M_insert_unique_
>      0.0   0.0% 100.0%      0.1   0.0% OpTracker::unregister_inflight_op
>      0.0   0.0% 100.0%      0.0   0.0% OSD::ms_verify_authorizer
>      0.0   0.0% 100.0%      0.0   0.0% OSDService::_add_map
>      0.0   0.0% 100.0%      0.1   0.0% OSD::wait_for_new_map
>      0.0   0.0% 100.0%      0.5   0.0% OSD::handle_pg_notify
>      0.0   0.0% 100.0%      0.0   0.0% std::__shared_count::__shared_count
>      0.0   0.0% 100.0%      0.0   0.0% std::__shared_ptr::reset
>      0.0   0.0% 100.0%     35.1   0.1% OSDMap::decode@b84080
>      0.0   0.0% 100.0%      0.0   0.0% std::_Rb_tree::_M_emplace_unique
>      0.0   0.0% 100.0%      0.0   0.0% std::vector::operator=
>      0.0   0.0% 100.0%      0.0   0.0% MonClient::_renew_subs
>      0.0   0.0% 100.0%      0.0   0.0% std::_Hashtable::_M_emplace
>      0.0   0.0% 100.0%      0.0   0.0% PORT_Alloc_Util
>      0.0   0.0% 100.0%      0.0   0.0% CryptoAES::get_key_handler
>      0.0   0.0% 100.0%      0.0   0.0% get_auth_session_handler
>      0.0   0.0% 100.0%      0.0   0.0% PosixWorker::connect
>      0.0   0.0% 100.0%      0.0   0.0% ceph::buffer::list::append@c4a440
>      0.0   0.0% 100.0%      0.0   0.0% std::vector::_M_fill_insert
>      0.0   0.0% 100.0%      4.8   0.0% AsyncConnection::fault
>      0.0   0.0% 100.0%      0.0   0.0% OSD::send_pg_stats
>      0.0   0.0% 100.0%      0.0   0.0% AsyncMessenger::accept_conn
>      0.0   0.0% 100.0%      0.0   0.0% PosixServerSocketImpl::accept
>      0.0   0.0% 100.0%      9.3   0.0% AsyncConnection::_process_connection
>      0.0   0.0% 100.0%      0.2   0.0% FileStore::lfn_open
>      0.0   0.0% 100.0%      0.0   0.0% ceph::buffer::list::append@c4a350
>      0.0   0.0% 100.0%      0.0   0.0% crush_create
>      0.0   0.0% 100.0%      0.1   0.0% MgrClient::send_report
>      0.0   0.0% 100.0%      0.0   0.0% WBThrottle::queue_wb
>      0.0   0.0% 100.0%      0.2   0.0% LogClient::_get_mon_log_message
>      0.0   0.0% 100.0%      0.0   0.0% CryptoKey::_set_secret
>      0.0   0.0% 100.0%      0.0   0.0% std::_Deque_base::_M_initialize_map
>      0.0   0.0% 100.0%      0.1   0.0%
> ThreadPool::BatchWorkQueue::_void_dequeue
>      0.0   0.0% 100.0%      0.0   0.0% ceph::Formatter::create@ba6a50
>      0.0   0.0% 100.0%      0.0   0.0% MonClient::schedule_tick
>      0.0   0.0% 100.0%      0.1   0.0% OSD::tick
>      0.0   0.0% 100.0%     37.6   0.1% OSD::tick_without_osd_lock
>      0.0   0.0% 100.0%      0.0   0.0%
> boost::spirit::classic::impl::get_definition
>      0.0   0.0% 100.0%      9.4   0.0% MonClient::_send_mon_message
>      0.0   0.0% 100.0%      0.0   0.0% DispatchQueue::queue_refused
>      0.0   0.0% 100.0%      0.0   0.0% OSD::handle_command
>      0.0   0.0% 100.0%      0.0   0.0% DispatchQueue::queue_accept
>      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::_connect
>      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::_stop
>      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::accept
>      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::handle_connect_msg
>      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::mark_down
>      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::prepare_send_message
>      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::read_bulk
>      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::read_until
>      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::send_keepalive
>      0.0   0.0% 100.0%      3.3   0.0% AsyncConnection::wakeup_from
>      0.0   0.0% 100.0%      1.8   0.0% AsyncMessenger::get_connection
>      0.0   0.0% 100.0%      0.0   0.0% AsyncMessenger::reap_dead
>      0.0   0.0% 100.0%      2.5   0.0% C_OnMapCommit::finish
>      0.0   0.0% 100.0%      0.0   0.0%
> CephXTicketHandler::verify_service_ticket_reply
>      0.0   0.0% 100.0%      0.0   0.0%
> CephXTicketManager::verify_service_ticket_reply
>      0.0   0.0% 100.0%      0.0   0.0%
> CephxAuthorizeHandler::verify_authorizer
>      0.0   0.0% 100.0%      0.0   0.0% CephxClientHandler::handle_response
>      0.0   0.0% 100.0%     40.9   0.2% Context::complete
>      0.0   0.0% 100.0%      4.8   0.0% CrushWrapper::encode
>      0.0   0.0% 100.0%      0.0   0.0% CryptoAESKeyHandler::decrypt
>      0.0   0.0% 100.0%      0.0   0.0% CryptoKey::decode
>      0.0   0.0% 100.0%    160.4   0.6% DispatchQueue::DispatchThread::entry
>      0.0   0.0% 100.0%    160.4   0.6% DispatchQueue::entry
>      0.0   0.0% 100.0%      0.4   0.0% DispatchQueue::fast_dispatch
>      0.0   0.0% 100.0%      0.4   0.0% DispatchQueue::pre_dispatch
>      0.0   0.0% 100.0%      0.0   0.0% EntityName::set
>      0.0   0.0% 100.0%      0.0   0.0% EpollDriver::event_wait
>      0.0   0.0% 100.0%      3.0   0.0% EventCenter::dispatch_event_external
>      0.0   0.0% 100.0%      3.3   0.0% EventCenter::process_time_events
>      0.0   0.0% 100.0%      3.0   0.0% EventCenter::wakeup
>      0.0   0.0% 100.0%      0.0   0.0% FileJournal::prepare_entry
>      0.0   0.0% 100.0%      0.2   0.0% FileStore::_do_op
>      0.0   0.0% 100.0%      0.2   0.0% FileStore::_do_transaction
>      0.0   0.0% 100.0%      0.2   0.0% FileStore::_do_transactions
>      0.0   0.0% 100.0%      0.0   0.0% FileStore::_journaled_ahead
>      0.0   0.0% 100.0%      0.2   0.0% FileStore::_write
>      0.0   0.0% 100.0%      0.0   0.0% FileStore::queue_transactions
>      0.0   0.0% 100.0%      2.6   0.0% Finisher::finisher_thread_entry
>      0.0   0.0% 100.0%      0.1   0.0% FunctionContext::finish
>      0.0   0.0% 100.0%      0.1   0.0% HitSet::Params::decode
>      0.0   0.0% 100.0%      0.2   0.0% LogChannel::do_log@a90a00
>      0.0   0.0% 100.0%      0.3   0.0% LogChannel::do_log@a91030
>      0.0   0.0% 100.0%      0.2   0.0% LogClient::get_mon_log_message
>      0.0   0.0% 100.0%      0.0   0.0% LogClient::handle_log_ack
>      0.0   0.0% 100.0%      0.1   0.0% LogClient::queue
>      0.0   0.0% 100.0%      0.3   0.0% LogClientTemp::~LogClientTemp
>      0.0   0.0% 100.0%      0.0   0.0% MAuthReply::decode_payload
>      0.0   0.0% 100.0%      0.0   0.0% MCommand::decode_payload
>      0.0   0.0% 100.0%      0.0   0.0% MCommand::print
>      0.0   0.0% 100.0%      0.0   0.0% MMgrMap::decode_payload
>      0.0   0.0% 100.0%      0.0   0.0% MOSDFailure::print
>      0.0   0.0% 100.0%      0.1   0.0% MOSDMap::decode_payload
>      0.0   0.0% 100.0%    203.1   0.8% MOSDPGNotify::decode_payload
>      0.0   0.0% 100.0%      0.0   0.0% MOSDPGNotify::print
>      0.0   0.0% 100.0%      0.0   0.0% MOSDPing::encode_payload
>      0.0   0.0% 100.0%      0.0   0.0% Message::encode
>      0.0   0.0% 100.0%      0.0   0.0% MgrClient::handle_mgr_map
>      0.0   0.0% 100.0%      0.0   0.0% MgrClient::ms_dispatch
>      0.0   0.0% 100.0%      0.0   0.0% MgrMap::decode
>      0.0   0.0% 100.0%      0.0   0.0% MonClient::_check_auth_rotating
>      0.0   0.0% 100.0%      0.0   0.0% MonClient::_check_auth_tickets
>      0.0   0.0% 100.0%      0.0   0.0% MonClient::_finish_hunting
>      0.0   0.0% 100.0%      0.8   0.0% MonClient::_reopen_session@aeab80
>      0.0   0.0% 100.0%      0.6   0.0% MonClient::_reopen_session@af2ba0
>      0.0   0.0% 100.0%      9.5   0.0% MonClient::handle_auth
>      0.0   0.0% 100.0%      9.6   0.0% MonClient::ms_dispatch
>      0.0   0.0% 100.0%      0.2   0.0% MonClient::send_log
>      0.0   0.0% 100.0%      0.6   0.0% MonClient::tick
>      0.0   0.0% 100.0%    283.5   1.0% NetworkStack::get_worker
>      0.0   0.0% 100.0%      0.0   0.0% OSD::CommandWQ::_process
>      0.0   0.0% 100.0%  25862.5  95.7% OSD::PeeringWQ::_process
>      0.0   0.0% 100.0%      0.0   0.0% OSD::Session::Session
>      0.0   0.0% 100.0%    686.3   2.5% OSD::T_Heartbeat::entry
>      0.0   0.0% 100.0%      2.5   0.0% OSD::_committed_osd_maps
>      0.0   0.0% 100.0%  25804.6  95.5% OSD::advance_pg
>      0.0   0.0% 100.0%      0.3   0.0% OSD::check_ops_in_flight
>      0.0   0.0% 100.0%      0.0   0.0% OSD::check_osdmap_features
>      0.0   0.0% 100.0%      2.5   0.0% OSD::consume_map
>      0.0   0.0% 100.0%     57.8   0.2% OSD::dispatch_context
>      0.0   0.0% 100.0%      0.5   0.0% OSD::dispatch_op
>      0.0   0.0% 100.0%      0.0   0.0% OSD::do_command
>      0.0   0.0% 100.0%      0.2   0.0% OSD::do_waiters
>      0.0   0.0% 100.0%      0.0   0.0% OSD::get_osdmap_pobject_name
>      0.0   0.0% 100.0%      0.1   0.0% OSD::handle_osd_ping
>      0.0   0.0% 100.0%      0.0   0.0% OSD::handle_pg_peering_evt
>      0.0   0.0% 100.0%     37.2   0.1% OSD::heartbeat_check
>      0.0   0.0% 100.0%      0.1   0.0% OSD::heartbeat_dispatch
>      0.0   0.0% 100.0%    686.3   2.5% OSD::heartbeat_entry
>      0.0   0.0% 100.0%      1.1   0.0% OSD::heartbeat_reset
>      0.0   0.0% 100.0%    148.7   0.6% OSD::ms_dispatch
>      0.0   0.0% 100.0%      0.8   0.0% OSD::ms_handle_connect
>      0.0   0.0% 100.0%      0.0   0.0% OSD::ms_handle_refused
>      0.0   0.0% 100.0%      0.0   0.0% OSD::ms_handle_reset
>      0.0   0.0% 100.0%  25862.5  95.7% OSD::process_peering_events
>      0.0   0.0% 100.0%      0.1   0.0% OSD::require_same_or_newer_map
>      0.0   0.0% 100.0%      0.0   0.0% OSD::write_superblock
>      0.0   0.0% 100.0%      0.0   0.0% OSDCap::parse
>      0.0   0.0% 100.0%      0.0   0.0% OSDMap::Incremental::decode
>      0.0   0.0% 100.0%     35.1   0.1% OSDMap::decode@b85440
>      0.0   0.0% 100.0%    110.8   0.4% OSDMap::encode
>      0.0   0.0% 100.0%      0.5   0.0% OSDMap::post_decode
>      0.0   0.0% 100.0%      0.1   0.0% OSDService::_get_map_bl
>      0.0   0.0% 100.0%      0.0   0.0% OSDService::check_nearfull_warning
>      0.0   0.0% 100.0%      0.1   0.0% OSDService::clear_map_bl_cache_pins
>      0.0   0.0% 100.0%      1.1   0.0% OSDService::get_con_osd_hb
>      0.0   0.0% 100.0%      1.3   0.0% OSDService::get_inc_map_bl
>      0.0   0.0% 100.0%      1.3   0.0% OSDService::pin_map_bl
>      0.0   0.0% 100.0%      0.0   0.0% OSDService::pin_map_inc_bl
>      0.0   0.0% 100.0%      0.0   0.0% OSDService::publish_superblock
>      0.0   0.0% 100.0%      0.3   0.0% OSDService::queue_for_peering
>      0.0   0.0% 100.0%     27.2   0.1% OSDService::send_incremental_map
>      0.0   0.0% 100.0%     27.2   0.1% OSDService::share_map_peer
>      0.0   0.0% 100.0%      0.0   0.0% OSDService::update_osd_stat
>      0.0   0.0% 100.0%      0.0   0.0%
> ObjectStore::Transaction::_get_coll_id
>      0.0   0.0% 100.0%      0.0   0.0%
> ObjectStore::Transaction::_get_next_op
>      0.0   0.0% 100.0%      0.2   0.0% ObjectStore::Transaction::write
>      0.0   0.0% 100.0%      0.0   0.0% ObjectStore::queue_transaction
>      0.0   0.0% 100.0%      0.0   0.0% Objecter::_maybe_request_map
>      0.0   0.0% 100.0%      0.1   0.0% Objecter::handle_osd_map
>      0.0   0.0% 100.0%      0.1   0.0% OpHistory::insert
>      0.0   0.0% 100.0%      0.0   0.0% OpRequest::OpRequest
>      0.0   0.0% 100.0%      0.1   0.0% OpRequest::mark_flag_point
>      0.0   0.0% 100.0%      0.1   0.0% OpRequest::mark_started
>      0.0   0.0% 100.0%      0.1   0.0% OpTracker::RemoveOnDelete::operator
>      0.0   0.0% 100.0%      0.1   0.0% OpTracker::_mark_event
>      0.0   0.0% 100.0%      0.0   0.0% OpTracker::get_age_ms_histogram
>      0.0   0.0% 100.0%  25802.4  95.4% PG::RecoveryState::Stray::react
>      0.0   0.0% 100.0%      0.0   0.0% PG::_prepare_write_info
>      0.0   0.0% 100.0%  25802.4  95.4% PG::handle_activate_map
>      0.0   0.0% 100.0%      1.6   0.0% PG::handle_advance_map
>      0.0   0.0% 100.0%      0.0   0.0% PG::prepare_write_info
>      0.0   0.0% 100.0%      0.0   0.0% PG::write_if_dirty
>      0.0   0.0% 100.0%      1.6   0.0% PGPool::update
>      0.0   0.0% 100.0%      0.0   0.0% PK11_FreeSymKey
>      0.0   0.0% 100.0%      0.0   0.0% PK11_GetIVLength
>      0.0   0.0% 100.0%      0.0   0.0% PK11_ImportSymKey
>      0.0   0.0% 100.0%      0.0   0.0% PrebufferedStreambuf::overflow
>      0.0   0.0% 100.0%      1.8   0.0% Processor::accept
>      0.0   0.0% 100.0%      0.0   0.0% SECITEM_CopyItem_Util
>      0.0   0.0% 100.0%      0.0   0.0% SafeTimer::add_event_after
>      0.0   0.0% 100.0%      0.0   0.0% SafeTimer::add_event_at
>      0.0   0.0% 100.0%     38.4   0.1% SafeTimer::timer_thread
>      0.0   0.0% 100.0%     38.4   0.1% SafeTimerThread::entry
>      0.0   0.0% 100.0%  25862.7  95.7% ThreadPool::WorkThread::entry
>      0.0   0.0% 100.0%  25862.7  95.7% ThreadPool::worker
>      0.0   0.0% 100.0%  27023.8 100.0% __clone
>      0.0   0.0% 100.0%      0.0   0.0%
> boost::detail::function::void_function_obj_invoker2::invoke
>      0.0   0.0% 100.0%      0.0   0.0%
> boost::proto::detail::default_assign::impl::operator
>      0.0   0.0% 100.0%      0.0   0.0%
> boost::spirit::classic::impl::concrete_parser::do_parse_virtual
>      0.0   0.0% 100.0%      0.0   0.0% boost::spirit::qi::action::parse
>      0.0   0.0% 100.0%      0.3   0.0%
> boost::statechart::event_base::intrusive_from_this
>      0.0   0.0% 100.0%  25802.4  95.4%
> boost::statechart::simple_state::react_impl
>      0.0   0.0% 100.0%  25802.4  95.4%
> boost::statechart::state_machine::send_event
>      0.0   0.0% 100.0%      0.0   0.0% ceph::Formatter::create@48b620
>      0.0   0.0% 100.0%      0.4   0.0%
> ceph::buffer::list::contiguous_appender::contiguous_appender
>      0.0   0.0% 100.0%      2.4   0.0% ceph::buffer::list::crc32c
>      0.0   0.0% 100.0%      0.1   0.0%
> ceph::buffer::list::iterator_impl::copy
>      0.0   0.0% 100.0%      0.0   0.0%
> ceph::buffer::list::iterator_impl::copy_deep
>      0.0   0.0% 100.0%      5.7   0.0%
> ceph::buffer::list::iterator_impl::copy_shallow
>      0.0   0.0% 100.0%      0.0   0.0% ceph::buffer::ptr::ptr
>      0.0   0.0% 100.0%      0.0   0.0% ceph_heap_profiler_handle_command
>      0.0   0.0% 100.0%      0.0   0.0% ceph_os_fremovexattr
>      0.0   0.0% 100.0%      0.0   0.0% cephx_verify_authorizer
>      0.0   0.0% 100.0%      0.0   0.0% cmdmap_from_json
>      0.0   0.0% 100.0%      2.2   0.0% crush_hash_name
>      0.0   0.0% 100.0%      0.1   0.0% decode
>      0.0   0.0% 100.0%     20.1   0.1% entity_addr_t::encode
>      0.0   0.0% 100.0%      0.0   0.0% get_str_vec
>      0.0   0.0% 100.0%      0.0   0.0% int decode_decrypt@c15110
>      0.0   0.0% 100.0%      0.0   0.0% int decode_decrypt@c15b90
>      0.0   0.0% 100.0%      0.0   0.0%
> json_spirit::Semantic_actions::new_name
>      0.0   0.0% 100.0%      0.0   0.0%
> json_spirit::Semantic_actions::new_str
>      0.0   0.0% 100.0%      1.1   0.0% json_spirit::Value_impl::get_uint64
>      0.0   0.0% 100.0%      0.0   0.0% json_spirit::get_str
>      0.0   0.0% 100.0%      0.0   0.0% json_spirit::get_str_
>      0.0   0.0% 100.0%      0.0   0.0% json_spirit::read
>      0.0   0.0% 100.0%      0.0   0.0% json_spirit::read_range
>      0.0   0.0% 100.0%      0.0   0.0% json_spirit::read_range_or_throw
>      0.0   0.0% 100.0%      0.0   0.0% json_spirit::substitute_esc_chars
>      0.0   0.0% 100.0%      0.0   0.0% operator<<@a91e90
>      0.0   0.0% 100.0%      3.5   0.0% osd_info_t::encode
>      0.0   0.0% 100.0%      4.4   0.0% osd_xinfo_t::encode
>      0.0   0.0% 100.0%      0.1   0.0% pg_info_t::decode
>      0.0   0.0% 100.0%      0.0   0.0% pg_info_t::operator=
>      0.0   0.0% 100.0%      9.9   0.0% pg_info_t::pg_info_t
>      0.0   0.0% 100.0%     87.5   0.3% pg_interval_t::decode
>      0.0   0.0% 100.0%      1.0   0.0% pg_pool_t::decode
>      0.0   0.0% 100.0%      1.8   0.0% pg_pool_t::encode
>      0.0   0.0% 100.0%      0.0   0.0% pg_stat_t::decode
>      0.0   0.0% 100.0%  27032.1 100.0% start_thread
>      0.0   0.0% 100.0%      1.3   0.0% std::_Rb_tree::operator=
>      0.0   0.0% 100.0%      0.1   0.0% std::_Sp_counted_base::_M_release
>      0.0   0.0% 100.0%      0.0   0.0% std::__detail::_Map_base::operator[]
>      0.0   0.0% 100.0%      0.0   0.0% std::__ostream_insert
>      0.0   0.0% 100.0%      0.1   0.0% std::basic_streambuf::xsputn
>      0.0   0.0% 100.0%      0.1   0.0% std::basic_string::basic_string
>      0.0   0.0% 100.0%      0.0   0.0% std::basic_stringbuf::overflow
>      0.0   0.0% 100.0%      1.0   0.0% std::basic_stringbuf::str
>      0.0   0.0% 100.0%     71.0   0.3% std::enable_if::type encode
>      0.0   0.0% 100.0%      0.1   0.0% std::getline
>      0.0   0.0% 100.0%      0.0   0.0% std::num_put::_M_insert_int
>      0.0   0.0% 100.0%      0.0   0.0% std::num_put::do_put
>      0.0   0.0% 100.0%      0.0   0.0% std::operator<<
>      0.0   0.0% 100.0%      0.0   0.0% std::ostream::_M_insert
>      0.0   0.0% 100.0%      1.2   0.0% std::string::_Rep::_M_clone
>      0.0   0.0% 100.0%      1.2   0.0% std::string::_S_construct
>      0.0   0.0% 100.0%      1.2   0.0% std::string::append
>      0.0   0.0% 100.0%      1.2   0.0% std::string::reserve
>      0.0   0.0% 100.0%    283.5   1.0% std::this_thread::__sleep_for
>      0.0   0.0% 100.0%      0.0   0.0% void decode_decrypt_enc_bl@c12db0
>      0.0   0.0% 100.0%      0.0   0.0% void decode_decrypt_enc_bl@c14a80
>      0.0   0.0% 100.0%      0.0   0.0% void decode_decrypt_enc_bl@c15450
>      0.0   0.0% 100.0%     20.1   0.1% void encode
>
>
> I also generated the PDf with all the charts, but not sure how to share it
> with you guys.
> any Idea what is happening here ?
>
> thanks
> ali
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: High memory usage kills OSD while peering
       [not found]     ` <CAGtbiz1eHTiaO4pWu4sU97E8N+=DthTXjbY_Ga9CONW862y2XQ@mail.gmail.com>
@ 2017-08-21 10:48       ` Linux Chips
  0 siblings, 0 replies; 27+ messages in thread
From: Linux Chips @ 2017-08-21 10:48 UTC (permalink / raw)
  To: Василий
	Ангапов
  Cc: Gregory Farnum, ceph-devel

Hi Vasily;
well, we are trying to get help from Red Hat. but that will take some time.
any idea of what they have changed in there? what was the cause of 
stopping the cluster? any info might be helpful here.

thanks

On 08/18/2017 05:02 PM, Василий Ангапов wrote:
> Hi,
>
> We had exactly the same problem like you, we stopped the whole cluster 
> and then we were unable to start because of OSDs getting OOMed. We 
> have 10 nodes with 29 OSDs each, 1.5 PB of raw space with erasure 
> coding 6+3. We had 10.2.3 community Ceph version.
> We had nodes with 192 GB RAM each, we then increased RAM to 1TB and 
> slowly were starting OSDs one by one, but still at some point 
> everything went down very quickly.
> We requested paid help from Red Hat and after some time they produced 
> a special patch for us with version 10.2.3-374-gc3d3a11 
> (c3d3a11c068ee2fbab73208c3d5e01ba2f86afc4). After that memory 
> consumption went back to normal and we were able to start cluster.
> Not sure this is exactly your problem but the symptoms are very much 
> the same. I can elaborate more on that if you like.
>
> Regards, Vasily.
>
>
> 2017-08-18 0:21 GMT+05:30 Linux Chips <linux.chips@gmail.com 
> <mailto:linux.chips@gmail.com>>:
>
>
>
>     On 08/17/2017 08:53 PM, Gregory Farnum wrote:
>
>         On Thu, Aug 17, 2017 at 7:13 AM, Linux Chips
>         <linux.chips@gmail.com <mailto:linux.chips@gmail.com>> wrote:
>
>             Hello everybody,
>             I have Kraken cluster with 660 OSD, currently it is down
>             due to not
>             being able to complete peering, OSDs start consuming lots
>             of memory
>             draining the system and killing the node, so I set a limit
>             on the OSD
>             service (on some OSDs 28G and others as high as 35G), so
>             they get
>             killed before taking down the whole node.
>             Now I still can't peer, one OSD entering the cluster (with
>             about 300
>             already up) makes memory usage of most other OSDs so high
>             (15G+, some as
>             much as 30G) and
>             sometimes kills them when they reach the service limit.
>             which cause a spiral
>             load and causing all the OSDs to consume all the available.
>
>             I found this thread with similar symptoms:
>
>             http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-April/017522.html
>             <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-April/017522.html>
>
>             with a request for stack trace, I have a 14G core dump, we
>             generated it by
>             running the osd from the terminal, enabling the core
>             dumps, and setting
>             ulimits to 15G. what kind of a trace would be useful? all
>             thread?! any
>             better way to debug this?
>
>             What can I do do make it work, is this memory allocation
>             normal?
>
>             some info about the cluster:
>             41 hdd nodes with 12 x 4TB osd each, 5 of the nodes have
>             8TB disks. 324 GB
>             RAM and dula socket intel xeon.
>             7 nodes with 400GB x 24 ssd and 256GB RAM, and dual socket
>             cpu.
>             3 monitors
>
>             all dual 10GB ethernet, except for the monitor with dual
>             1GB ethers.
>
>             all nodes running centos 7.2
>             it is an old cluster that was upgraded continuously for
>             the past 3 years.
>             the cluster was on jewel when the issue happened due to
>             some accidental OSD
>             map changes, causing a heavy recovery operations on the
>             cluster. then we
>             upgraded to kraken in the hope of less memory foot prints.
>
>             any advice on how to proceed?
>
>         It's not normal but if something really bad happened to your
>         cluster,
>         it's been known to occur. You should go through the
>         troubleshooting
>         guides at docs.ceph.com <http://docs.ceph.com>, but the
>         general strategy is to set
>         nodown/noout/etc flags, undo whatever horrible thing you tried
>         to make
>         the map do, and then turn all the OSDs back on.
>         -Greg
>
>
>     Hi,
>     we have been trying this for the past week, it keeps consuming the
>     RAM.
>     we got the map back to the original places. marked all the flags,
>     started all the OSDs. then "ceph osd unset noup", wait 5 min, and
>     all OSDs are killed by the oom.
>     we tried one node at a time, let it finish recovering, and start
>     the next. we got to a point when we started the next node, every
>     thing got killed.
>     we tried one OSD at a time, same result. one OSD up, ~40 killed by
>     oom, then it is a snow ball from here until all of the active OSDs
>     get kiiled.
>
>     I think all this up/down that we generated has increased the
>     recovery too much. btw, we stopped all clients. and also we have
>     some not so friendly erasure pools. some OSDs now report loading
>     as much as 800 pg, while we originally had about 300-400 (I know
>     too much, but we were trying to fix it and.... well we could not).
>
>     we did a memory profiling on one of the OSDs.
>     here is the results
>
>
>      12878.6  47.6%  47.6%  12878.6  47.6% std::_Rb_tree::_M_create_node
>      12867.6  47.6%  95.2%  25746.2  95.2% std::_Rb_tree::_M_copy
>        532.4   2.0%  97.2%    686.3   2.5% OSD::heartbeat
>        122.8   0.5%  97.7%    122.8   0.5%
>     std::_Rb_tree::_M_emplace_hint_unique
>        121.9   0.5%  98.1%    171.1   0.6% AsyncConnection::send_message
>        104.2   0.4%  98.5%    104.2   0.4%
>     ceph::buffer::list::append@c4a770
>         99.7   0.4%  98.9%     99.7   0.4% std::vector::_M_default_append
>         99.6   0.4%  99.2%     99.6   0.4%
>     ceph::logging::Log::create_entry
>         72.6   0.3%  99.5%     72.6   0.3% ceph::buffer::create_aligned
>         52.4   0.2%  99.7%     52.5   0.2%
>     std::vector::_M_emplace_back_aux
>         23.9   0.1%  99.8%     57.8   0.2% OSD::do_notifies
>         17.0   0.1%  99.8%     23.1   0.1%
>     OSDService::build_incremental_map_msg
>          9.8   0.0%  99.9%    222.5   0.8% std::enable_if::type decode
>          6.2   0.0%  99.9%      6.3   0.0% std::map::operator[]
>          5.5   0.0%  99.9%      5.5   0.0% std::vector::vector
>          3.5   0.0%  99.9%      3.5   0.0% EventCenter::create_time_event
>          2.5   0.0%  99.9%      2.5   0.0%
>     AsyncConnection::AsyncConnection
>          2.4   0.0% 100.0%      2.4   0.0% std::string::_Rep::_S_create
>          1.5   0.0% 100.0%      1.5   0.0% std::_Rb_tree::_M_insert_unique
>          1.4   0.0% 100.0%      1.4   0.0% std::list::operator=
>          1.3   0.0% 100.0%      1.3   0.0% ceph::buffer::list::list
>          0.9   0.0% 100.0%    204.1   0.8% decode_message
>          0.8   0.0% 100.0%      0.8   0.0% OSD::send_failures
>          0.7   0.0% 100.0%      0.9   0.0% void decode
>          0.6   0.0% 100.0%      0.6   0.0% std::_Rb_tree::_M_insert_equal
>          0.6   0.0% 100.0%      2.5   0.0% PG::queue_null
>          0.6   0.0% 100.0%      1.8   0.0% AsyncMessenger::create_connect
>          0.6   0.0% 100.0%      1.8   0.0% AsyncMessenger::add_accept
>          0.5   0.0% 100.0%      0.5   0.0% boost::statechart::event::clone
>          0.4   0.0% 100.0%      0.4   0.0% PG::queue_peering_event
>          0.3   0.0% 100.0%      0.3   0.0% OSD::PeeringWQ::_enqueue
>          0.3   0.0% 100.0%    148.6   0.5% OSD::_dispatch
>          0.1   0.0% 100.0%    147.9   0.5% OSD::handle_osd_map
>          0.1   0.0% 100.0%      0.1   0.0% std::deque::_M_push_back_aux
>          0.1   0.0% 100.0%      0.2   0.0% SharedLRU::add
>          0.1   0.0% 100.0%      0.1   0.0% OSD::PeeringWQ::_dequeue
>          0.1   0.0% 100.0%      0.1   0.0%
>     ceph::buffer::list::append@c4a9b0
>          0.1   0.0% 100.0%      0.2   0.0% DispatchQueue::enqueue
>          0.1   0.0% 100.0%    283.5   1.0% EventCenter::process_events
>          0.1   0.0% 100.0%      0.1   0.0% HitSet::Params::create_impl
>          0.1   0.0% 100.0%      0.1   0.0% SimpleLRU::clear_pinned
>          0.0   0.0% 100.0%      0.0   0.0% std::_Rb_tree::_M_insert_
>          0.0   0.0% 100.0%      0.2   0.0% TrackedOp::mark_event
>          0.0   0.0% 100.0%      0.0   0.0% OSD::create_context
>          0.0   0.0% 100.0%      0.0   0.0%
>     std::_Hashtable::_M_allocate_node
>          0.0   0.0% 100.0%      0.0   0.0% OSDMap::OSDMap
>          0.0   0.0% 100.0%    281.6   1.0% AsyncConnection::process
>          0.0   0.0% 100.0%  25802.4  95.4%
>     PG::RecoveryState::RecoveryMachine::send_notify
>          0.0   0.0% 100.0%      0.0   0.0% SharedLRU::lru_add
>          0.0   0.0% 100.0%      0.0   0.0%
>     std::_Rb_tree::_M_insert_unique_
>          0.0   0.0% 100.0%      0.1   0.0%
>     OpTracker::unregister_inflight_op
>          0.0   0.0% 100.0%      0.0   0.0% OSD::ms_verify_authorizer
>          0.0   0.0% 100.0%      0.0   0.0% OSDService::_add_map
>          0.0   0.0% 100.0%      0.1   0.0% OSD::wait_for_new_map
>          0.0   0.0% 100.0%      0.5   0.0% OSD::handle_pg_notify
>          0.0   0.0% 100.0%      0.0   0.0%
>     std::__shared_count::__shared_count
>          0.0   0.0% 100.0%      0.0   0.0% std::__shared_ptr::reset
>          0.0   0.0% 100.0%     35.1   0.1% OSDMap::decode@b84080
>          0.0   0.0% 100.0%      0.0   0.0%
>     std::_Rb_tree::_M_emplace_unique
>          0.0   0.0% 100.0%      0.0   0.0% std::vector::operator=
>          0.0   0.0% 100.0%      0.0   0.0% MonClient::_renew_subs
>          0.0   0.0% 100.0%      0.0   0.0% std::_Hashtable::_M_emplace
>          0.0   0.0% 100.0%      0.0   0.0% PORT_Alloc_Util
>          0.0   0.0% 100.0%      0.0   0.0% CryptoAES::get_key_handler
>          0.0   0.0% 100.0%      0.0   0.0% get_auth_session_handler
>          0.0   0.0% 100.0%      0.0   0.0% PosixWorker::connect
>          0.0   0.0% 100.0%      0.0   0.0%
>     ceph::buffer::list::append@c4a440
>          0.0   0.0% 100.0%      0.0   0.0% std::vector::_M_fill_insert
>          0.0   0.0% 100.0%      4.8   0.0% AsyncConnection::fault
>          0.0   0.0% 100.0%      0.0   0.0% OSD::send_pg_stats
>          0.0   0.0% 100.0%      0.0   0.0% AsyncMessenger::accept_conn
>          0.0   0.0% 100.0%      0.0   0.0% PosixServerSocketImpl::accept
>          0.0   0.0% 100.0%      9.3   0.0%
>     AsyncConnection::_process_connection
>          0.0   0.0% 100.0%      0.2   0.0% FileStore::lfn_open
>          0.0   0.0% 100.0%      0.0   0.0%
>     ceph::buffer::list::append@c4a350
>          0.0   0.0% 100.0%      0.0   0.0% crush_create
>          0.0   0.0% 100.0%      0.1   0.0% MgrClient::send_report
>          0.0   0.0% 100.0%      0.0   0.0% WBThrottle::queue_wb
>          0.0   0.0% 100.0%      0.2   0.0% LogClient::_get_mon_log_message
>          0.0   0.0% 100.0%      0.0   0.0% CryptoKey::_set_secret
>          0.0   0.0% 100.0%      0.0   0.0%
>     std::_Deque_base::_M_initialize_map
>          0.0   0.0% 100.0%      0.1   0.0%
>     ThreadPool::BatchWorkQueue::_void_dequeue
>          0.0   0.0% 100.0%      0.0   0.0% ceph::Formatter::create@ba6a50
>          0.0   0.0% 100.0%      0.0   0.0% MonClient::schedule_tick
>          0.0   0.0% 100.0%      0.1   0.0% OSD::tick
>          0.0   0.0% 100.0%     37.6   0.1% OSD::tick_without_osd_lock
>          0.0   0.0% 100.0%      0.0   0.0%
>     boost::spirit::classic::impl::get_definition
>          0.0   0.0% 100.0%      9.4   0.0% MonClient::_send_mon_message
>          0.0   0.0% 100.0%      0.0   0.0% DispatchQueue::queue_refused
>          0.0   0.0% 100.0%      0.0   0.0% OSD::handle_command
>          0.0   0.0% 100.0%      0.0   0.0% DispatchQueue::queue_accept
>          0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::_connect
>          0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::_stop
>          0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::accept
>          0.0   0.0% 100.0%      0.0   0.0%
>     AsyncConnection::handle_connect_msg
>          0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::mark_down
>          0.0   0.0% 100.0%      0.0   0.0%
>     AsyncConnection::prepare_send_message
>          0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::read_bulk
>          0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::read_until
>          0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::send_keepalive
>          0.0   0.0% 100.0%      3.3   0.0% AsyncConnection::wakeup_from
>          0.0   0.0% 100.0%      1.8   0.0% AsyncMessenger::get_connection
>          0.0   0.0% 100.0%      0.0   0.0% AsyncMessenger::reap_dead
>          0.0   0.0% 100.0%      2.5   0.0% C_OnMapCommit::finish
>          0.0   0.0% 100.0%      0.0   0.0%
>     CephXTicketHandler::verify_service_ticket_reply
>          0.0   0.0% 100.0%      0.0   0.0%
>     CephXTicketManager::verify_service_ticket_reply
>          0.0   0.0% 100.0%      0.0   0.0%
>     CephxAuthorizeHandler::verify_authorizer
>          0.0   0.0% 100.0%      0.0   0.0%
>     CephxClientHandler::handle_response
>          0.0   0.0% 100.0%     40.9   0.2% Context::complete
>          0.0   0.0% 100.0%      4.8   0.0% CrushWrapper::encode
>          0.0   0.0% 100.0%      0.0   0.0% CryptoAESKeyHandler::decrypt
>          0.0   0.0% 100.0%      0.0   0.0% CryptoKey::decode
>          0.0   0.0% 100.0%    160.4   0.6%
>     DispatchQueue::DispatchThread::entry
>          0.0   0.0% 100.0%    160.4   0.6% DispatchQueue::entry
>          0.0   0.0% 100.0%      0.4   0.0% DispatchQueue::fast_dispatch
>          0.0   0.0% 100.0%      0.4   0.0% DispatchQueue::pre_dispatch
>          0.0   0.0% 100.0%      0.0   0.0% EntityName::set
>          0.0   0.0% 100.0%      0.0   0.0% EpollDriver::event_wait
>          0.0   0.0% 100.0%      3.0   0.0%
>     EventCenter::dispatch_event_external
>          0.0   0.0% 100.0%      3.3   0.0%
>     EventCenter::process_time_events
>          0.0   0.0% 100.0%      3.0   0.0% EventCenter::wakeup
>          0.0   0.0% 100.0%      0.0   0.0% FileJournal::prepare_entry
>          0.0   0.0% 100.0%      0.2   0.0% FileStore::_do_op
>          0.0   0.0% 100.0%      0.2   0.0% FileStore::_do_transaction
>          0.0   0.0% 100.0%      0.2   0.0% FileStore::_do_transactions
>          0.0   0.0% 100.0%      0.0   0.0% FileStore::_journaled_ahead
>          0.0   0.0% 100.0%      0.2   0.0% FileStore::_write
>          0.0   0.0% 100.0%      0.0   0.0% FileStore::queue_transactions
>          0.0   0.0% 100.0%      2.6   0.0% Finisher::finisher_thread_entry
>          0.0   0.0% 100.0%      0.1   0.0% FunctionContext::finish
>          0.0   0.0% 100.0%      0.1   0.0% HitSet::Params::decode
>          0.0   0.0% 100.0%      0.2   0.0% LogChannel::do_log@a90a00
>          0.0   0.0% 100.0%      0.3   0.0% LogChannel::do_log@a91030
>          0.0   0.0% 100.0%      0.2   0.0% LogClient::get_mon_log_message
>          0.0   0.0% 100.0%      0.0   0.0% LogClient::handle_log_ack
>          0.0   0.0% 100.0%      0.1   0.0% LogClient::queue
>          0.0   0.0% 100.0%      0.3   0.0% LogClientTemp::~LogClientTemp
>          0.0   0.0% 100.0%      0.0   0.0% MAuthReply::decode_payload
>          0.0   0.0% 100.0%      0.0   0.0% MCommand::decode_payload
>          0.0   0.0% 100.0%      0.0   0.0% MCommand::print
>          0.0   0.0% 100.0%      0.0   0.0% MMgrMap::decode_payload
>          0.0   0.0% 100.0%      0.0   0.0% MOSDFailure::print
>          0.0   0.0% 100.0%      0.1   0.0% MOSDMap::decode_payload
>          0.0   0.0% 100.0%    203.1   0.8% MOSDPGNotify::decode_payload
>          0.0   0.0% 100.0%      0.0   0.0% MOSDPGNotify::print
>          0.0   0.0% 100.0%      0.0   0.0% MOSDPing::encode_payload
>          0.0   0.0% 100.0%      0.0   0.0% Message::encode
>          0.0   0.0% 100.0%      0.0   0.0% MgrClient::handle_mgr_map
>          0.0   0.0% 100.0%      0.0   0.0% MgrClient::ms_dispatch
>          0.0   0.0% 100.0%      0.0   0.0% MgrMap::decode
>          0.0   0.0% 100.0%      0.0   0.0% MonClient::_check_auth_rotating
>          0.0   0.0% 100.0%      0.0   0.0% MonClient::_check_auth_tickets
>          0.0   0.0% 100.0%      0.0   0.0% MonClient::_finish_hunting
>          0.0   0.0% 100.0%      0.8   0.0%
>     MonClient::_reopen_session@aeab80
>          0.0   0.0% 100.0%      0.6   0.0%
>     MonClient::_reopen_session@af2ba0
>          0.0   0.0% 100.0%      9.5   0.0% MonClient::handle_auth
>          0.0   0.0% 100.0%      9.6   0.0% MonClient::ms_dispatch
>          0.0   0.0% 100.0%      0.2   0.0% MonClient::send_log
>          0.0   0.0% 100.0%      0.6   0.0% MonClient::tick
>          0.0   0.0% 100.0%    283.5   1.0% NetworkStack::get_worker
>          0.0   0.0% 100.0%      0.0   0.0% OSD::CommandWQ::_process
>          0.0   0.0% 100.0%  25862.5  95.7% OSD::PeeringWQ::_process
>          0.0   0.0% 100.0%      0.0   0.0% OSD::Session::Session
>          0.0   0.0% 100.0%    686.3   2.5% OSD::T_Heartbeat::entry
>          0.0   0.0% 100.0%      2.5   0.0% OSD::_committed_osd_maps
>          0.0   0.0% 100.0%  25804.6  95.5% OSD::advance_pg
>          0.0   0.0% 100.0%      0.3   0.0% OSD::check_ops_in_flight
>          0.0   0.0% 100.0%      0.0   0.0% OSD::check_osdmap_features
>          0.0   0.0% 100.0%      2.5   0.0% OSD::consume_map
>          0.0   0.0% 100.0%     57.8   0.2% OSD::dispatch_context
>          0.0   0.0% 100.0%      0.5   0.0% OSD::dispatch_op
>          0.0   0.0% 100.0%      0.0   0.0% OSD::do_command
>          0.0   0.0% 100.0%      0.2   0.0% OSD::do_waiters
>          0.0   0.0% 100.0%      0.0   0.0% OSD::get_osdmap_pobject_name
>          0.0   0.0% 100.0%      0.1   0.0% OSD::handle_osd_ping
>          0.0   0.0% 100.0%      0.0   0.0% OSD::handle_pg_peering_evt
>          0.0   0.0% 100.0%     37.2   0.1% OSD::heartbeat_check
>          0.0   0.0% 100.0%      0.1   0.0% OSD::heartbeat_dispatch
>          0.0   0.0% 100.0%    686.3   2.5% OSD::heartbeat_entry
>          0.0   0.0% 100.0%      1.1   0.0% OSD::heartbeat_reset
>          0.0   0.0% 100.0%    148.7   0.6% OSD::ms_dispatch
>          0.0   0.0% 100.0%      0.8   0.0% OSD::ms_handle_connect
>          0.0   0.0% 100.0%      0.0   0.0% OSD::ms_handle_refused
>          0.0   0.0% 100.0%      0.0   0.0% OSD::ms_handle_reset
>          0.0   0.0% 100.0%  25862.5  95.7% OSD::process_peering_events
>          0.0   0.0% 100.0%      0.1   0.0% OSD::require_same_or_newer_map
>          0.0   0.0% 100.0%      0.0   0.0% OSD::write_superblock
>          0.0   0.0% 100.0%      0.0   0.0% OSDCap::parse
>          0.0   0.0% 100.0%      0.0   0.0% OSDMap::Incremental::decode
>          0.0   0.0% 100.0%     35.1   0.1% OSDMap::decode@b85440
>          0.0   0.0% 100.0%    110.8   0.4% OSDMap::encode
>          0.0   0.0% 100.0%      0.5   0.0% OSDMap::post_decode
>          0.0   0.0% 100.0%      0.1   0.0% OSDService::_get_map_bl
>          0.0   0.0% 100.0%      0.0   0.0%
>     OSDService::check_nearfull_warning
>          0.0   0.0% 100.0%      0.1   0.0%
>     OSDService::clear_map_bl_cache_pins
>          0.0   0.0% 100.0%      1.1   0.0% OSDService::get_con_osd_hb
>          0.0   0.0% 100.0%      1.3   0.0% OSDService::get_inc_map_bl
>          0.0   0.0% 100.0%      1.3   0.0% OSDService::pin_map_bl
>          0.0   0.0% 100.0%      0.0   0.0% OSDService::pin_map_inc_bl
>          0.0   0.0% 100.0%      0.0   0.0% OSDService::publish_superblock
>          0.0   0.0% 100.0%      0.3   0.0% OSDService::queue_for_peering
>          0.0   0.0% 100.0%     27.2   0.1%
>     OSDService::send_incremental_map
>          0.0   0.0% 100.0%     27.2   0.1% OSDService::share_map_peer
>          0.0   0.0% 100.0%      0.0   0.0% OSDService::update_osd_stat
>          0.0   0.0% 100.0%      0.0   0.0%
>     ObjectStore::Transaction::_get_coll_id
>          0.0   0.0% 100.0%      0.0   0.0%
>     ObjectStore::Transaction::_get_next_op
>          0.0   0.0% 100.0%      0.2   0.0% ObjectStore::Transaction::write
>          0.0   0.0% 100.0%      0.0   0.0% ObjectStore::queue_transaction
>          0.0   0.0% 100.0%      0.0   0.0% Objecter::_maybe_request_map
>          0.0   0.0% 100.0%      0.1   0.0% Objecter::handle_osd_map
>          0.0   0.0% 100.0%      0.1   0.0% OpHistory::insert
>          0.0   0.0% 100.0%      0.0   0.0% OpRequest::OpRequest
>          0.0   0.0% 100.0%      0.1   0.0% OpRequest::mark_flag_point
>          0.0   0.0% 100.0%      0.1   0.0% OpRequest::mark_started
>          0.0   0.0% 100.0%      0.1   0.0%
>     OpTracker::RemoveOnDelete::operator
>          0.0   0.0% 100.0%      0.1   0.0% OpTracker::_mark_event
>          0.0   0.0% 100.0%      0.0   0.0% OpTracker::get_age_ms_histogram
>          0.0   0.0% 100.0%  25802.4  95.4% PG::RecoveryState::Stray::react
>          0.0   0.0% 100.0%      0.0   0.0% PG::_prepare_write_info
>          0.0   0.0% 100.0%  25802.4  95.4% PG::handle_activate_map
>          0.0   0.0% 100.0%      1.6   0.0% PG::handle_advance_map
>          0.0   0.0% 100.0%      0.0   0.0% PG::prepare_write_info
>          0.0   0.0% 100.0%      0.0   0.0% PG::write_if_dirty
>          0.0   0.0% 100.0%      1.6   0.0% PGPool::update
>          0.0   0.0% 100.0%      0.0   0.0% PK11_FreeSymKey
>          0.0   0.0% 100.0%      0.0   0.0% PK11_GetIVLength
>          0.0   0.0% 100.0%      0.0   0.0% PK11_ImportSymKey
>          0.0   0.0% 100.0%      0.0   0.0% PrebufferedStreambuf::overflow
>          0.0   0.0% 100.0%      1.8   0.0% Processor::accept
>          0.0   0.0% 100.0%      0.0   0.0% SECITEM_CopyItem_Util
>          0.0   0.0% 100.0%      0.0   0.0% SafeTimer::add_event_after
>          0.0   0.0% 100.0%      0.0   0.0% SafeTimer::add_event_at
>          0.0   0.0% 100.0%     38.4   0.1% SafeTimer::timer_thread
>          0.0   0.0% 100.0%     38.4   0.1% SafeTimerThread::entry
>          0.0   0.0% 100.0%  25862.7  95.7% ThreadPool::WorkThread::entry
>          0.0   0.0% 100.0%  25862.7  95.7% ThreadPool::worker
>          0.0   0.0% 100.0%  27023.8 100.0% __clone
>          0.0   0.0% 100.0%      0.0   0.0%
>     boost::detail::function::void_function_obj_invoker2::invoke
>          0.0   0.0% 100.0%      0.0   0.0%
>     boost::proto::detail::default_assign::impl::operator
>          0.0   0.0% 100.0%      0.0   0.0%
>     boost::spirit::classic::impl::concrete_parser::do_parse_virtual
>          0.0   0.0% 100.0%      0.0   0.0%
>     boost::spirit::qi::action::parse
>          0.0   0.0% 100.0%      0.3   0.0%
>     boost::statechart::event_base::intrusive_from_this
>          0.0   0.0% 100.0%  25802.4  95.4%
>     boost::statechart::simple_state::react_impl
>          0.0   0.0% 100.0%  25802.4  95.4%
>     boost::statechart::state_machine::send_event
>          0.0   0.0% 100.0%      0.0   0.0% ceph::Formatter::create@48b620
>          0.0   0.0% 100.0%      0.4   0.0%
>     ceph::buffer::list::contiguous_appender::contiguous_appender
>          0.0   0.0% 100.0%      2.4   0.0% ceph::buffer::list::crc32c
>          0.0   0.0% 100.0%      0.1   0.0%
>     ceph::buffer::list::iterator_impl::copy
>          0.0   0.0% 100.0%      0.0   0.0%
>     ceph::buffer::list::iterator_impl::copy_deep
>          0.0   0.0% 100.0%      5.7   0.0%
>     ceph::buffer::list::iterator_impl::copy_shallow
>          0.0   0.0% 100.0%      0.0   0.0% ceph::buffer::ptr::ptr
>          0.0   0.0% 100.0%      0.0   0.0%
>     ceph_heap_profiler_handle_command
>          0.0   0.0% 100.0%      0.0   0.0% ceph_os_fremovexattr
>          0.0   0.0% 100.0%      0.0   0.0% cephx_verify_authorizer
>          0.0   0.0% 100.0%      0.0   0.0% cmdmap_from_json
>          0.0   0.0% 100.0%      2.2   0.0% crush_hash_name
>          0.0   0.0% 100.0%      0.1   0.0% decode
>          0.0   0.0% 100.0%     20.1   0.1% entity_addr_t::encode
>          0.0   0.0% 100.0%      0.0   0.0% get_str_vec
>          0.0   0.0% 100.0%      0.0   0.0% int decode_decrypt@c15110
>          0.0   0.0% 100.0%      0.0   0.0% int decode_decrypt@c15b90
>          0.0   0.0% 100.0%      0.0   0.0%
>     json_spirit::Semantic_actions::new_name
>          0.0   0.0% 100.0%      0.0   0.0%
>     json_spirit::Semantic_actions::new_str
>          0.0   0.0% 100.0%      1.1   0.0%
>     json_spirit::Value_impl::get_uint64
>          0.0   0.0% 100.0%      0.0   0.0% json_spirit::get_str
>          0.0   0.0% 100.0%      0.0   0.0% json_spirit::get_str_
>          0.0   0.0% 100.0%      0.0   0.0% json_spirit::read
>          0.0   0.0% 100.0%      0.0   0.0% json_spirit::read_range
>          0.0   0.0% 100.0%      0.0   0.0%
>     json_spirit::read_range_or_throw
>          0.0   0.0% 100.0%      0.0   0.0%
>     json_spirit::substitute_esc_chars
>          0.0   0.0% 100.0%      0.0   0.0% operator<<@a91e90
>          0.0   0.0% 100.0%      3.5   0.0% osd_info_t::encode
>          0.0   0.0% 100.0%      4.4   0.0% osd_xinfo_t::encode
>          0.0   0.0% 100.0%      0.1   0.0% pg_info_t::decode
>          0.0   0.0% 100.0%      0.0   0.0% pg_info_t::operator=
>          0.0   0.0% 100.0%      9.9   0.0% pg_info_t::pg_info_t
>          0.0   0.0% 100.0%     87.5   0.3% pg_interval_t::decode
>          0.0   0.0% 100.0%      1.0   0.0% pg_pool_t::decode
>          0.0   0.0% 100.0%      1.8   0.0% pg_pool_t::encode
>          0.0   0.0% 100.0%      0.0   0.0% pg_stat_t::decode
>          0.0   0.0% 100.0%  27032.1 100.0% start_thread
>          0.0   0.0% 100.0%      1.3   0.0% std::_Rb_tree::operator=
>          0.0   0.0% 100.0%      0.1   0.0%
>     std::_Sp_counted_base::_M_release
>          0.0   0.0% 100.0%      0.0   0.0%
>     std::__detail::_Map_base::operator[]
>          0.0   0.0% 100.0%      0.0   0.0% std::__ostream_insert
>          0.0   0.0% 100.0%      0.1   0.0% std::basic_streambuf::xsputn
>          0.0   0.0% 100.0%      0.1   0.0% std::basic_string::basic_string
>          0.0   0.0% 100.0%      0.0   0.0% std::basic_stringbuf::overflow
>          0.0   0.0% 100.0%      1.0   0.0% std::basic_stringbuf::str
>          0.0   0.0% 100.0%     71.0   0.3% std::enable_if::type encode
>          0.0   0.0% 100.0%      0.1   0.0% std::getline
>          0.0   0.0% 100.0%      0.0   0.0% std::num_put::_M_insert_int
>          0.0   0.0% 100.0%      0.0   0.0% std::num_put::do_put
>          0.0   0.0% 100.0%      0.0   0.0% std::operator<<
>          0.0   0.0% 100.0%      0.0   0.0% std::ostream::_M_insert
>          0.0   0.0% 100.0%      1.2   0.0% std::string::_Rep::_M_clone
>          0.0   0.0% 100.0%      1.2   0.0% std::string::_S_construct
>          0.0   0.0% 100.0%      1.2   0.0% std::string::append
>          0.0   0.0% 100.0%      1.2   0.0% std::string::reserve
>          0.0   0.0% 100.0%    283.5   1.0% std::this_thread::__sleep_for
>          0.0   0.0% 100.0%      0.0   0.0% void
>     decode_decrypt_enc_bl@c12db0
>          0.0   0.0% 100.0%      0.0   0.0% void
>     decode_decrypt_enc_bl@c14a80
>          0.0   0.0% 100.0%      0.0   0.0% void
>     decode_decrypt_enc_bl@c15450
>          0.0   0.0% 100.0%     20.1   0.1% void encode
>
>
>     I also generated the PDf with all the charts, but not sure how to
>     share it with you guys.
>     any Idea what is happening here ?
>
>     thanks
>     ali
>
>
>
>     --
>     To unsubscribe from this list: send the line "unsubscribe
>     ceph-devel" in
>     the body of a message to majordomo@vger.kernel.org
>     <mailto:majordomo@vger.kernel.org>
>     More majordomo info at http://vger.kernel.org/majordomo-info.html
>     <http://vger.kernel.org/majordomo-info.html>
>
>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: High memory usage kills OSD while peering
  2017-08-17 18:51   ` Linux Chips
  2017-08-19 16:38     ` Mustafa Muhammad
       [not found]     ` <CAGtbiz1eHTiaO4pWu4sU97E8N+=DthTXjbY_Ga9CONW862y2XQ@mail.gmail.com>
@ 2017-08-21 10:57     ` Linux Chips
  2017-08-21 13:07       ` Haomai Wang
  2 siblings, 1 reply; 27+ messages in thread
From: Linux Chips @ 2017-08-21 10:57 UTC (permalink / raw)
  To: ceph-devel

Hi,
I have an idea to move pools out of the "current" directory (like move 
them into a directory "current.bak"), and keep only one pool at a time 
in there so the OSD would load less PGs.
any one tried to do this before? will we have a data loss?


On 08/17/2017 09:51 PM, Linux Chips wrote:
>
>
> On 08/17/2017 08:53 PM, Gregory Farnum wrote:
>> On Thu, Aug 17, 2017 at 7:13 AM, Linux Chips <linux.chips@gmail.com> 
>> wrote:
>>> Hello everybody,
>>> I have Kraken cluster with 660 OSD, currently it is down due to not
>>> being able to complete peering, OSDs start consuming lots of memory
>>> draining the system and killing the node, so I set a limit on the OSD
>>> service (on some OSDs 28G and others as high as 35G), so they get
>>> killed before taking down the whole node.
>>> Now I still can't peer, one OSD entering the cluster (with about 300
>>> already up) makes memory usage of most other OSDs so high (15G+, 
>>> some as
>>> much as 30G) and
>>> sometimes kills them when they reach the service limit. which cause 
>>> a spiral
>>> load and causing all the OSDs to consume all the available.
>>>
>>> I found this thread with similar symptoms:
>>>
>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-April/017522.html 
>>>
>>>
>>> with a request for stack trace, I have a 14G core dump, we generated 
>>> it by
>>> running the osd from the terminal, enabling the core dumps, and setting
>>> ulimits to 15G. what kind of a trace would be useful? all thread?! any
>>> better way to debug this?
>>>
>>> What can I do do make it work, is this memory allocation normal?
>>>
>>> some info about the cluster:
>>> 41 hdd nodes with 12 x 4TB osd each, 5 of the nodes have 8TB disks. 
>>> 324 GB
>>> RAM and dula socket intel xeon.
>>> 7 nodes with 400GB x 24 ssd and 256GB RAM, and dual socket cpu.
>>> 3 monitors
>>>
>>> all dual 10GB ethernet, except for the monitor with dual 1GB ethers.
>>>
>>> all nodes running centos 7.2
>>> it is an old cluster that was upgraded continuously for the past 3 
>>> years.
>>> the cluster was on jewel when the issue happened due to some 
>>> accidental OSD
>>> map changes, causing a heavy recovery operations on the cluster. 
>>> then we
>>> upgraded to kraken in the hope of less memory foot prints.
>>>
>>> any advice on how to proceed?
>> It's not normal but if something really bad happened to your cluster,
>> it's been known to occur. You should go through the troubleshooting
>> guides at docs.ceph.com, but the general strategy is to set
>> nodown/noout/etc flags, undo whatever horrible thing you tried to make
>> the map do, and then turn all the OSDs back on.
>> -Greg
>
> Hi,
> we have been trying this for the past week, it keeps consuming the RAM.
> we got the map back to the original places. marked all the flags, 
> started all the OSDs. then "ceph osd unset noup", wait 5 min, and all 
> OSDs are killed by the oom.
> we tried one node at a time, let it finish recovering, and start the 
> next. we got to a point when we started the next node, every thing got 
> killed.
> we tried one OSD at a time, same result. one OSD up, ~40 killed by 
> oom, then it is a snow ball from here until all of the active OSDs get 
> kiiled.
>
> I think all this up/down that we generated has increased the recovery 
> too much. btw, we stopped all clients. and also we have some not so 
> friendly erasure pools. some OSDs now report loading as much as 800 
> pg, while we originally had about 300-400 (I know too much, but we 
> were trying to fix it and.... well we could not).
>
> we did a memory profiling on one of the OSDs.
> here is the results
>
>
>  12878.6  47.6%  47.6%  12878.6  47.6% std::_Rb_tree::_M_create_node
>  12867.6  47.6%  95.2%  25746.2  95.2% std::_Rb_tree::_M_copy
>    532.4   2.0%  97.2%    686.3   2.5% OSD::heartbeat
>    122.8   0.5%  97.7%    122.8   0.5% 
> std::_Rb_tree::_M_emplace_hint_unique
>    121.9   0.5%  98.1%    171.1   0.6% AsyncConnection::send_message
>    104.2   0.4%  98.5%    104.2   0.4% ceph::buffer::list::append@c4a770
>     99.7   0.4%  98.9%     99.7   0.4% std::vector::_M_default_append
>     99.6   0.4%  99.2%     99.6   0.4% ceph::logging::Log::create_entry
>     72.6   0.3%  99.5%     72.6   0.3% ceph::buffer::create_aligned
>     52.4   0.2%  99.7%     52.5   0.2% std::vector::_M_emplace_back_aux
>     23.9   0.1%  99.8%     57.8   0.2% OSD::do_notifies
>     17.0   0.1%  99.8%     23.1   0.1% 
> OSDService::build_incremental_map_msg
>      9.8   0.0%  99.9%    222.5   0.8% std::enable_if::type decode
>      6.2   0.0%  99.9%      6.3   0.0% std::map::operator[]
>      5.5   0.0%  99.9%      5.5   0.0% std::vector::vector
>      3.5   0.0%  99.9%      3.5   0.0% EventCenter::create_time_event
>      2.5   0.0%  99.9%      2.5   0.0% AsyncConnection::AsyncConnection
>      2.4   0.0% 100.0%      2.4   0.0% std::string::_Rep::_S_create
>      1.5   0.0% 100.0%      1.5   0.0% std::_Rb_tree::_M_insert_unique
>      1.4   0.0% 100.0%      1.4   0.0% std::list::operator=
>      1.3   0.0% 100.0%      1.3   0.0% ceph::buffer::list::list
>      0.9   0.0% 100.0%    204.1   0.8% decode_message
>      0.8   0.0% 100.0%      0.8   0.0% OSD::send_failures
>      0.7   0.0% 100.0%      0.9   0.0% void decode
>      0.6   0.0% 100.0%      0.6   0.0% std::_Rb_tree::_M_insert_equal
>      0.6   0.0% 100.0%      2.5   0.0% PG::queue_null
>      0.6   0.0% 100.0%      1.8   0.0% AsyncMessenger::create_connect
>      0.6   0.0% 100.0%      1.8   0.0% AsyncMessenger::add_accept
>      0.5   0.0% 100.0%      0.5   0.0% boost::statechart::event::clone
>      0.4   0.0% 100.0%      0.4   0.0% PG::queue_peering_event
>      0.3   0.0% 100.0%      0.3   0.0% OSD::PeeringWQ::_enqueue
>      0.3   0.0% 100.0%    148.6   0.5% OSD::_dispatch
>      0.1   0.0% 100.0%    147.9   0.5% OSD::handle_osd_map
>      0.1   0.0% 100.0%      0.1   0.0% std::deque::_M_push_back_aux
>      0.1   0.0% 100.0%      0.2   0.0% SharedLRU::add
>      0.1   0.0% 100.0%      0.1   0.0% OSD::PeeringWQ::_dequeue
>      0.1   0.0% 100.0%      0.1   0.0% ceph::buffer::list::append@c4a9b0
>      0.1   0.0% 100.0%      0.2   0.0% DispatchQueue::enqueue
>      0.1   0.0% 100.0%    283.5   1.0% EventCenter::process_events
>      0.1   0.0% 100.0%      0.1   0.0% HitSet::Params::create_impl
>      0.1   0.0% 100.0%      0.1   0.0% SimpleLRU::clear_pinned
>      0.0   0.0% 100.0%      0.0   0.0% std::_Rb_tree::_M_insert_
>      0.0   0.0% 100.0%      0.2   0.0% TrackedOp::mark_event
>      0.0   0.0% 100.0%      0.0   0.0% OSD::create_context
>      0.0   0.0% 100.0%      0.0   0.0% std::_Hashtable::_M_allocate_node
>      0.0   0.0% 100.0%      0.0   0.0% OSDMap::OSDMap
>      0.0   0.0% 100.0%    281.6   1.0% AsyncConnection::process
>      0.0   0.0% 100.0%  25802.4  95.4% 
> PG::RecoveryState::RecoveryMachine::send_notify
>      0.0   0.0% 100.0%      0.0   0.0% SharedLRU::lru_add
>      0.0   0.0% 100.0%      0.0   0.0% std::_Rb_tree::_M_insert_unique_
>      0.0   0.0% 100.0%      0.1   0.0% OpTracker::unregister_inflight_op
>      0.0   0.0% 100.0%      0.0   0.0% OSD::ms_verify_authorizer
>      0.0   0.0% 100.0%      0.0   0.0% OSDService::_add_map
>      0.0   0.0% 100.0%      0.1   0.0% OSD::wait_for_new_map
>      0.0   0.0% 100.0%      0.5   0.0% OSD::handle_pg_notify
>      0.0   0.0% 100.0%      0.0   0.0% 
> std::__shared_count::__shared_count
>      0.0   0.0% 100.0%      0.0   0.0% std::__shared_ptr::reset
>      0.0   0.0% 100.0%     35.1   0.1% OSDMap::decode@b84080
>      0.0   0.0% 100.0%      0.0   0.0% std::_Rb_tree::_M_emplace_unique
>      0.0   0.0% 100.0%      0.0   0.0% std::vector::operator=
>      0.0   0.0% 100.0%      0.0   0.0% MonClient::_renew_subs
>      0.0   0.0% 100.0%      0.0   0.0% std::_Hashtable::_M_emplace
>      0.0   0.0% 100.0%      0.0   0.0% PORT_Alloc_Util
>      0.0   0.0% 100.0%      0.0   0.0% CryptoAES::get_key_handler
>      0.0   0.0% 100.0%      0.0   0.0% get_auth_session_handler
>      0.0   0.0% 100.0%      0.0   0.0% PosixWorker::connect
>      0.0   0.0% 100.0%      0.0   0.0% ceph::buffer::list::append@c4a440
>      0.0   0.0% 100.0%      0.0   0.0% std::vector::_M_fill_insert
>      0.0   0.0% 100.0%      4.8   0.0% AsyncConnection::fault
>      0.0   0.0% 100.0%      0.0   0.0% OSD::send_pg_stats
>      0.0   0.0% 100.0%      0.0   0.0% AsyncMessenger::accept_conn
>      0.0   0.0% 100.0%      0.0   0.0% PosixServerSocketImpl::accept
>      0.0   0.0% 100.0%      9.3   0.0% 
> AsyncConnection::_process_connection
>      0.0   0.0% 100.0%      0.2   0.0% FileStore::lfn_open
>      0.0   0.0% 100.0%      0.0   0.0% ceph::buffer::list::append@c4a350
>      0.0   0.0% 100.0%      0.0   0.0% crush_create
>      0.0   0.0% 100.0%      0.1   0.0% MgrClient::send_report
>      0.0   0.0% 100.0%      0.0   0.0% WBThrottle::queue_wb
>      0.0   0.0% 100.0%      0.2   0.0% LogClient::_get_mon_log_message
>      0.0   0.0% 100.0%      0.0   0.0% CryptoKey::_set_secret
>      0.0   0.0% 100.0%      0.0   0.0% 
> std::_Deque_base::_M_initialize_map
>      0.0   0.0% 100.0%      0.1   0.0% 
> ThreadPool::BatchWorkQueue::_void_dequeue
>      0.0   0.0% 100.0%      0.0   0.0% ceph::Formatter::create@ba6a50
>      0.0   0.0% 100.0%      0.0   0.0% MonClient::schedule_tick
>      0.0   0.0% 100.0%      0.1   0.0% OSD::tick
>      0.0   0.0% 100.0%     37.6   0.1% OSD::tick_without_osd_lock
>      0.0   0.0% 100.0%      0.0   0.0% 
> boost::spirit::classic::impl::get_definition
>      0.0   0.0% 100.0%      9.4   0.0% MonClient::_send_mon_message
>      0.0   0.0% 100.0%      0.0   0.0% DispatchQueue::queue_refused
>      0.0   0.0% 100.0%      0.0   0.0% OSD::handle_command
>      0.0   0.0% 100.0%      0.0   0.0% DispatchQueue::queue_accept
>      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::_connect
>      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::_stop
>      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::accept
>      0.0   0.0% 100.0%      0.0   0.0% 
> AsyncConnection::handle_connect_msg
>      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::mark_down
>      0.0   0.0% 100.0%      0.0   0.0% 
> AsyncConnection::prepare_send_message
>      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::read_bulk
>      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::read_until
>      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::send_keepalive
>      0.0   0.0% 100.0%      3.3   0.0% AsyncConnection::wakeup_from
>      0.0   0.0% 100.0%      1.8   0.0% AsyncMessenger::get_connection
>      0.0   0.0% 100.0%      0.0   0.0% AsyncMessenger::reap_dead
>      0.0   0.0% 100.0%      2.5   0.0% C_OnMapCommit::finish
>      0.0   0.0% 100.0%      0.0   0.0% 
> CephXTicketHandler::verify_service_ticket_reply
>      0.0   0.0% 100.0%      0.0   0.0% 
> CephXTicketManager::verify_service_ticket_reply
>      0.0   0.0% 100.0%      0.0   0.0% 
> CephxAuthorizeHandler::verify_authorizer
>      0.0   0.0% 100.0%      0.0   0.0% 
> CephxClientHandler::handle_response
>      0.0   0.0% 100.0%     40.9   0.2% Context::complete
>      0.0   0.0% 100.0%      4.8   0.0% CrushWrapper::encode
>      0.0   0.0% 100.0%      0.0   0.0% CryptoAESKeyHandler::decrypt
>      0.0   0.0% 100.0%      0.0   0.0% CryptoKey::decode
>      0.0   0.0% 100.0%    160.4   0.6% 
> DispatchQueue::DispatchThread::entry
>      0.0   0.0% 100.0%    160.4   0.6% DispatchQueue::entry
>      0.0   0.0% 100.0%      0.4   0.0% DispatchQueue::fast_dispatch
>      0.0   0.0% 100.0%      0.4   0.0% DispatchQueue::pre_dispatch
>      0.0   0.0% 100.0%      0.0   0.0% EntityName::set
>      0.0   0.0% 100.0%      0.0   0.0% EpollDriver::event_wait
>      0.0   0.0% 100.0%      3.0   0.0% 
> EventCenter::dispatch_event_external
>      0.0   0.0% 100.0%      3.3   0.0% EventCenter::process_time_events
>      0.0   0.0% 100.0%      3.0   0.0% EventCenter::wakeup
>      0.0   0.0% 100.0%      0.0   0.0% FileJournal::prepare_entry
>      0.0   0.0% 100.0%      0.2   0.0% FileStore::_do_op
>      0.0   0.0% 100.0%      0.2   0.0% FileStore::_do_transaction
>      0.0   0.0% 100.0%      0.2   0.0% FileStore::_do_transactions
>      0.0   0.0% 100.0%      0.0   0.0% FileStore::_journaled_ahead
>      0.0   0.0% 100.0%      0.2   0.0% FileStore::_write
>      0.0   0.0% 100.0%      0.0   0.0% FileStore::queue_transactions
>      0.0   0.0% 100.0%      2.6   0.0% Finisher::finisher_thread_entry
>      0.0   0.0% 100.0%      0.1   0.0% FunctionContext::finish
>      0.0   0.0% 100.0%      0.1   0.0% HitSet::Params::decode
>      0.0   0.0% 100.0%      0.2   0.0% LogChannel::do_log@a90a00
>      0.0   0.0% 100.0%      0.3   0.0% LogChannel::do_log@a91030
>      0.0   0.0% 100.0%      0.2   0.0% LogClient::get_mon_log_message
>      0.0   0.0% 100.0%      0.0   0.0% LogClient::handle_log_ack
>      0.0   0.0% 100.0%      0.1   0.0% LogClient::queue
>      0.0   0.0% 100.0%      0.3   0.0% LogClientTemp::~LogClientTemp
>      0.0   0.0% 100.0%      0.0   0.0% MAuthReply::decode_payload
>      0.0   0.0% 100.0%      0.0   0.0% MCommand::decode_payload
>      0.0   0.0% 100.0%      0.0   0.0% MCommand::print
>      0.0   0.0% 100.0%      0.0   0.0% MMgrMap::decode_payload
>      0.0   0.0% 100.0%      0.0   0.0% MOSDFailure::print
>      0.0   0.0% 100.0%      0.1   0.0% MOSDMap::decode_payload
>      0.0   0.0% 100.0%    203.1   0.8% MOSDPGNotify::decode_payload
>      0.0   0.0% 100.0%      0.0   0.0% MOSDPGNotify::print
>      0.0   0.0% 100.0%      0.0   0.0% MOSDPing::encode_payload
>      0.0   0.0% 100.0%      0.0   0.0% Message::encode
>      0.0   0.0% 100.0%      0.0   0.0% MgrClient::handle_mgr_map
>      0.0   0.0% 100.0%      0.0   0.0% MgrClient::ms_dispatch
>      0.0   0.0% 100.0%      0.0   0.0% MgrMap::decode
>      0.0   0.0% 100.0%      0.0   0.0% MonClient::_check_auth_rotating
>      0.0   0.0% 100.0%      0.0   0.0% MonClient::_check_auth_tickets
>      0.0   0.0% 100.0%      0.0   0.0% MonClient::_finish_hunting
>      0.0   0.0% 100.0%      0.8   0.0% MonClient::_reopen_session@aeab80
>      0.0   0.0% 100.0%      0.6   0.0% MonClient::_reopen_session@af2ba0
>      0.0   0.0% 100.0%      9.5   0.0% MonClient::handle_auth
>      0.0   0.0% 100.0%      9.6   0.0% MonClient::ms_dispatch
>      0.0   0.0% 100.0%      0.2   0.0% MonClient::send_log
>      0.0   0.0% 100.0%      0.6   0.0% MonClient::tick
>      0.0   0.0% 100.0%    283.5   1.0% NetworkStack::get_worker
>      0.0   0.0% 100.0%      0.0   0.0% OSD::CommandWQ::_process
>      0.0   0.0% 100.0%  25862.5  95.7% OSD::PeeringWQ::_process
>      0.0   0.0% 100.0%      0.0   0.0% OSD::Session::Session
>      0.0   0.0% 100.0%    686.3   2.5% OSD::T_Heartbeat::entry
>      0.0   0.0% 100.0%      2.5   0.0% OSD::_committed_osd_maps
>      0.0   0.0% 100.0%  25804.6  95.5% OSD::advance_pg
>      0.0   0.0% 100.0%      0.3   0.0% OSD::check_ops_in_flight
>      0.0   0.0% 100.0%      0.0   0.0% OSD::check_osdmap_features
>      0.0   0.0% 100.0%      2.5   0.0% OSD::consume_map
>      0.0   0.0% 100.0%     57.8   0.2% OSD::dispatch_context
>      0.0   0.0% 100.0%      0.5   0.0% OSD::dispatch_op
>      0.0   0.0% 100.0%      0.0   0.0% OSD::do_command
>      0.0   0.0% 100.0%      0.2   0.0% OSD::do_waiters
>      0.0   0.0% 100.0%      0.0   0.0% OSD::get_osdmap_pobject_name
>      0.0   0.0% 100.0%      0.1   0.0% OSD::handle_osd_ping
>      0.0   0.0% 100.0%      0.0   0.0% OSD::handle_pg_peering_evt
>      0.0   0.0% 100.0%     37.2   0.1% OSD::heartbeat_check
>      0.0   0.0% 100.0%      0.1   0.0% OSD::heartbeat_dispatch
>      0.0   0.0% 100.0%    686.3   2.5% OSD::heartbeat_entry
>      0.0   0.0% 100.0%      1.1   0.0% OSD::heartbeat_reset
>      0.0   0.0% 100.0%    148.7   0.6% OSD::ms_dispatch
>      0.0   0.0% 100.0%      0.8   0.0% OSD::ms_handle_connect
>      0.0   0.0% 100.0%      0.0   0.0% OSD::ms_handle_refused
>      0.0   0.0% 100.0%      0.0   0.0% OSD::ms_handle_reset
>      0.0   0.0% 100.0%  25862.5  95.7% OSD::process_peering_events
>      0.0   0.0% 100.0%      0.1   0.0% OSD::require_same_or_newer_map
>      0.0   0.0% 100.0%      0.0   0.0% OSD::write_superblock
>      0.0   0.0% 100.0%      0.0   0.0% OSDCap::parse
>      0.0   0.0% 100.0%      0.0   0.0% OSDMap::Incremental::decode
>      0.0   0.0% 100.0%     35.1   0.1% OSDMap::decode@b85440
>      0.0   0.0% 100.0%    110.8   0.4% OSDMap::encode
>      0.0   0.0% 100.0%      0.5   0.0% OSDMap::post_decode
>      0.0   0.0% 100.0%      0.1   0.0% OSDService::_get_map_bl
>      0.0   0.0% 100.0%      0.0   0.0% OSDService::check_nearfull_warning
>      0.0   0.0% 100.0%      0.1   0.0% 
> OSDService::clear_map_bl_cache_pins
>      0.0   0.0% 100.0%      1.1   0.0% OSDService::get_con_osd_hb
>      0.0   0.0% 100.0%      1.3   0.0% OSDService::get_inc_map_bl
>      0.0   0.0% 100.0%      1.3   0.0% OSDService::pin_map_bl
>      0.0   0.0% 100.0%      0.0   0.0% OSDService::pin_map_inc_bl
>      0.0   0.0% 100.0%      0.0   0.0% OSDService::publish_superblock
>      0.0   0.0% 100.0%      0.3   0.0% OSDService::queue_for_peering
>      0.0   0.0% 100.0%     27.2   0.1% OSDService::send_incremental_map
>      0.0   0.0% 100.0%     27.2   0.1% OSDService::share_map_peer
>      0.0   0.0% 100.0%      0.0   0.0% OSDService::update_osd_stat
>      0.0   0.0% 100.0%      0.0   0.0% 
> ObjectStore::Transaction::_get_coll_id
>      0.0   0.0% 100.0%      0.0   0.0% 
> ObjectStore::Transaction::_get_next_op
>      0.0   0.0% 100.0%      0.2   0.0% ObjectStore::Transaction::write
>      0.0   0.0% 100.0%      0.0   0.0% ObjectStore::queue_transaction
>      0.0   0.0% 100.0%      0.0   0.0% Objecter::_maybe_request_map
>      0.0   0.0% 100.0%      0.1   0.0% Objecter::handle_osd_map
>      0.0   0.0% 100.0%      0.1   0.0% OpHistory::insert
>      0.0   0.0% 100.0%      0.0   0.0% OpRequest::OpRequest
>      0.0   0.0% 100.0%      0.1   0.0% OpRequest::mark_flag_point
>      0.0   0.0% 100.0%      0.1   0.0% OpRequest::mark_started
>      0.0   0.0% 100.0%      0.1   0.0% 
> OpTracker::RemoveOnDelete::operator
>      0.0   0.0% 100.0%      0.1   0.0% OpTracker::_mark_event
>      0.0   0.0% 100.0%      0.0   0.0% OpTracker::get_age_ms_histogram
>      0.0   0.0% 100.0%  25802.4  95.4% PG::RecoveryState::Stray::react
>      0.0   0.0% 100.0%      0.0   0.0% PG::_prepare_write_info
>      0.0   0.0% 100.0%  25802.4  95.4% PG::handle_activate_map
>      0.0   0.0% 100.0%      1.6   0.0% PG::handle_advance_map
>      0.0   0.0% 100.0%      0.0   0.0% PG::prepare_write_info
>      0.0   0.0% 100.0%      0.0   0.0% PG::write_if_dirty
>      0.0   0.0% 100.0%      1.6   0.0% PGPool::update
>      0.0   0.0% 100.0%      0.0   0.0% PK11_FreeSymKey
>      0.0   0.0% 100.0%      0.0   0.0% PK11_GetIVLength
>      0.0   0.0% 100.0%      0.0   0.0% PK11_ImportSymKey
>      0.0   0.0% 100.0%      0.0   0.0% PrebufferedStreambuf::overflow
>      0.0   0.0% 100.0%      1.8   0.0% Processor::accept
>      0.0   0.0% 100.0%      0.0   0.0% SECITEM_CopyItem_Util
>      0.0   0.0% 100.0%      0.0   0.0% SafeTimer::add_event_after
>      0.0   0.0% 100.0%      0.0   0.0% SafeTimer::add_event_at
>      0.0   0.0% 100.0%     38.4   0.1% SafeTimer::timer_thread
>      0.0   0.0% 100.0%     38.4   0.1% SafeTimerThread::entry
>      0.0   0.0% 100.0%  25862.7  95.7% ThreadPool::WorkThread::entry
>      0.0   0.0% 100.0%  25862.7  95.7% ThreadPool::worker
>      0.0   0.0% 100.0%  27023.8 100.0% __clone
>      0.0   0.0% 100.0%      0.0   0.0% 
> boost::detail::function::void_function_obj_invoker2::invoke
>      0.0   0.0% 100.0%      0.0   0.0% 
> boost::proto::detail::default_assign::impl::operator
>      0.0   0.0% 100.0%      0.0   0.0% 
> boost::spirit::classic::impl::concrete_parser::do_parse_virtual
>      0.0   0.0% 100.0%      0.0   0.0% boost::spirit::qi::action::parse
>      0.0   0.0% 100.0%      0.3   0.0% 
> boost::statechart::event_base::intrusive_from_this
>      0.0   0.0% 100.0%  25802.4  95.4% 
> boost::statechart::simple_state::react_impl
>      0.0   0.0% 100.0%  25802.4  95.4% 
> boost::statechart::state_machine::send_event
>      0.0   0.0% 100.0%      0.0   0.0% ceph::Formatter::create@48b620
>      0.0   0.0% 100.0%      0.4   0.0% 
> ceph::buffer::list::contiguous_appender::contiguous_appender
>      0.0   0.0% 100.0%      2.4   0.0% ceph::buffer::list::crc32c
>      0.0   0.0% 100.0%      0.1   0.0% 
> ceph::buffer::list::iterator_impl::copy
>      0.0   0.0% 100.0%      0.0   0.0% 
> ceph::buffer::list::iterator_impl::copy_deep
>      0.0   0.0% 100.0%      5.7   0.0% 
> ceph::buffer::list::iterator_impl::copy_shallow
>      0.0   0.0% 100.0%      0.0   0.0% ceph::buffer::ptr::ptr
>      0.0   0.0% 100.0%      0.0   0.0% ceph_heap_profiler_handle_command
>      0.0   0.0% 100.0%      0.0   0.0% ceph_os_fremovexattr
>      0.0   0.0% 100.0%      0.0   0.0% cephx_verify_authorizer
>      0.0   0.0% 100.0%      0.0   0.0% cmdmap_from_json
>      0.0   0.0% 100.0%      2.2   0.0% crush_hash_name
>      0.0   0.0% 100.0%      0.1   0.0% decode
>      0.0   0.0% 100.0%     20.1   0.1% entity_addr_t::encode
>      0.0   0.0% 100.0%      0.0   0.0% get_str_vec
>      0.0   0.0% 100.0%      0.0   0.0% int decode_decrypt@c15110
>      0.0   0.0% 100.0%      0.0   0.0% int decode_decrypt@c15b90
>      0.0   0.0% 100.0%      0.0   0.0% 
> json_spirit::Semantic_actions::new_name
>      0.0   0.0% 100.0%      0.0   0.0% 
> json_spirit::Semantic_actions::new_str
>      0.0   0.0% 100.0%      1.1   0.0% 
> json_spirit::Value_impl::get_uint64
>      0.0   0.0% 100.0%      0.0   0.0% json_spirit::get_str
>      0.0   0.0% 100.0%      0.0   0.0% json_spirit::get_str_
>      0.0   0.0% 100.0%      0.0   0.0% json_spirit::read
>      0.0   0.0% 100.0%      0.0   0.0% json_spirit::read_range
>      0.0   0.0% 100.0%      0.0   0.0% json_spirit::read_range_or_throw
>      0.0   0.0% 100.0%      0.0   0.0% json_spirit::substitute_esc_chars
>      0.0   0.0% 100.0%      0.0   0.0% operator<<@a91e90
>      0.0   0.0% 100.0%      3.5   0.0% osd_info_t::encode
>      0.0   0.0% 100.0%      4.4   0.0% osd_xinfo_t::encode
>      0.0   0.0% 100.0%      0.1   0.0% pg_info_t::decode
>      0.0   0.0% 100.0%      0.0   0.0% pg_info_t::operator=
>      0.0   0.0% 100.0%      9.9   0.0% pg_info_t::pg_info_t
>      0.0   0.0% 100.0%     87.5   0.3% pg_interval_t::decode
>      0.0   0.0% 100.0%      1.0   0.0% pg_pool_t::decode
>      0.0   0.0% 100.0%      1.8   0.0% pg_pool_t::encode
>      0.0   0.0% 100.0%      0.0   0.0% pg_stat_t::decode
>      0.0   0.0% 100.0%  27032.1 100.0% start_thread
>      0.0   0.0% 100.0%      1.3   0.0% std::_Rb_tree::operator=
>      0.0   0.0% 100.0%      0.1   0.0% std::_Sp_counted_base::_M_release
>      0.0   0.0% 100.0%      0.0   0.0% 
> std::__detail::_Map_base::operator[]
>      0.0   0.0% 100.0%      0.0   0.0% std::__ostream_insert
>      0.0   0.0% 100.0%      0.1   0.0% std::basic_streambuf::xsputn
>      0.0   0.0% 100.0%      0.1   0.0% std::basic_string::basic_string
>      0.0   0.0% 100.0%      0.0   0.0% std::basic_stringbuf::overflow
>      0.0   0.0% 100.0%      1.0   0.0% std::basic_stringbuf::str
>      0.0   0.0% 100.0%     71.0   0.3% std::enable_if::type encode
>      0.0   0.0% 100.0%      0.1   0.0% std::getline
>      0.0   0.0% 100.0%      0.0   0.0% std::num_put::_M_insert_int
>      0.0   0.0% 100.0%      0.0   0.0% std::num_put::do_put
>      0.0   0.0% 100.0%      0.0   0.0% std::operator<<
>      0.0   0.0% 100.0%      0.0   0.0% std::ostream::_M_insert
>      0.0   0.0% 100.0%      1.2   0.0% std::string::_Rep::_M_clone
>      0.0   0.0% 100.0%      1.2   0.0% std::string::_S_construct
>      0.0   0.0% 100.0%      1.2   0.0% std::string::append
>      0.0   0.0% 100.0%      1.2   0.0% std::string::reserve
>      0.0   0.0% 100.0%    283.5   1.0% std::this_thread::__sleep_for
>      0.0   0.0% 100.0%      0.0   0.0% void decode_decrypt_enc_bl@c12db0
>      0.0   0.0% 100.0%      0.0   0.0% void decode_decrypt_enc_bl@c14a80
>      0.0   0.0% 100.0%      0.0   0.0% void decode_decrypt_enc_bl@c15450
>      0.0   0.0% 100.0%     20.1   0.1% void encode
>
>
> I also generated the PDf with all the charts, but not sure how to 
> share it with you guys.
> any Idea what is happening here ?
>
> thanks
> ali
>
>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: High memory usage kills OSD while peering
  2017-08-21 10:57     ` Linux Chips
@ 2017-08-21 13:07       ` Haomai Wang
       [not found]         ` <93debf2d-12cb-eceb-e9cd-5226ad49cc16@gmail.com>
  0 siblings, 1 reply; 27+ messages in thread
From: Haomai Wang @ 2017-08-21 13:07 UTC (permalink / raw)
  To: Linux Chips; +Cc: ceph-devel

do you try to lower osd_map_cache_size = 20 ?

On Mon, Aug 21, 2017 at 6:57 PM, Linux Chips <linux.chips@gmail.com> wrote:
> Hi,
> I have an idea to move pools out of the "current" directory (like move them
> into a directory "current.bak"), and keep only one pool at a time in there
> so the OSD would load less PGs.
> any one tried to do this before? will we have a data loss?
>
>
>
> On 08/17/2017 09:51 PM, Linux Chips wrote:
>>
>>
>>
>> On 08/17/2017 08:53 PM, Gregory Farnum wrote:
>>>
>>> On Thu, Aug 17, 2017 at 7:13 AM, Linux Chips <linux.chips@gmail.com>
>>> wrote:
>>>>
>>>> Hello everybody,
>>>> I have Kraken cluster with 660 OSD, currently it is down due to not
>>>> being able to complete peering, OSDs start consuming lots of memory
>>>> draining the system and killing the node, so I set a limit on the OSD
>>>> service (on some OSDs 28G and others as high as 35G), so they get
>>>> killed before taking down the whole node.
>>>> Now I still can't peer, one OSD entering the cluster (with about 300
>>>> already up) makes memory usage of most other OSDs so high (15G+, some as
>>>> much as 30G) and
>>>> sometimes kills them when they reach the service limit. which cause a
>>>> spiral
>>>> load and causing all the OSDs to consume all the available.
>>>>
>>>> I found this thread with similar symptoms:
>>>>
>>>>
>>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-April/017522.html
>>>>
>>>> with a request for stack trace, I have a 14G core dump, we generated it
>>>> by
>>>> running the osd from the terminal, enabling the core dumps, and setting
>>>> ulimits to 15G. what kind of a trace would be useful? all thread?! any
>>>> better way to debug this?
>>>>
>>>> What can I do do make it work, is this memory allocation normal?
>>>>
>>>> some info about the cluster:
>>>> 41 hdd nodes with 12 x 4TB osd each, 5 of the nodes have 8TB disks. 324
>>>> GB
>>>> RAM and dula socket intel xeon.
>>>> 7 nodes with 400GB x 24 ssd and 256GB RAM, and dual socket cpu.
>>>> 3 monitors
>>>>
>>>> all dual 10GB ethernet, except for the monitor with dual 1GB ethers.
>>>>
>>>> all nodes running centos 7.2
>>>> it is an old cluster that was upgraded continuously for the past 3
>>>> years.
>>>> the cluster was on jewel when the issue happened due to some accidental
>>>> OSD
>>>> map changes, causing a heavy recovery operations on the cluster. then we
>>>> upgraded to kraken in the hope of less memory foot prints.
>>>>
>>>> any advice on how to proceed?
>>>
>>> It's not normal but if something really bad happened to your cluster,
>>> it's been known to occur. You should go through the troubleshooting
>>> guides at docs.ceph.com, but the general strategy is to set
>>> nodown/noout/etc flags, undo whatever horrible thing you tried to make
>>> the map do, and then turn all the OSDs back on.
>>> -Greg
>>
>>
>> Hi,
>> we have been trying this for the past week, it keeps consuming the RAM.
>> we got the map back to the original places. marked all the flags, started
>> all the OSDs. then "ceph osd unset noup", wait 5 min, and all OSDs are
>> killed by the oom.
>> we tried one node at a time, let it finish recovering, and start the next.
>> we got to a point when we started the next node, every thing got killed.
>> we tried one OSD at a time, same result. one OSD up, ~40 killed by oom,
>> then it is a snow ball from here until all of the active OSDs get kiiled.
>>
>> I think all this up/down that we generated has increased the recovery too
>> much. btw, we stopped all clients. and also we have some not so friendly
>> erasure pools. some OSDs now report loading as much as 800 pg, while we
>> originally had about 300-400 (I know too much, but we were trying to fix it
>> and.... well we could not).
>>
>> we did a memory profiling on one of the OSDs.
>> here is the results
>>
>>
>>  12878.6  47.6%  47.6%  12878.6  47.6% std::_Rb_tree::_M_create_node
>>  12867.6  47.6%  95.2%  25746.2  95.2% std::_Rb_tree::_M_copy
>>    532.4   2.0%  97.2%    686.3   2.5% OSD::heartbeat
>>    122.8   0.5%  97.7%    122.8   0.5%
>> std::_Rb_tree::_M_emplace_hint_unique
>>    121.9   0.5%  98.1%    171.1   0.6% AsyncConnection::send_message
>>    104.2   0.4%  98.5%    104.2   0.4% ceph::buffer::list::append@c4a770
>>     99.7   0.4%  98.9%     99.7   0.4% std::vector::_M_default_append
>>     99.6   0.4%  99.2%     99.6   0.4% ceph::logging::Log::create_entry
>>     72.6   0.3%  99.5%     72.6   0.3% ceph::buffer::create_aligned
>>     52.4   0.2%  99.7%     52.5   0.2% std::vector::_M_emplace_back_aux
>>     23.9   0.1%  99.8%     57.8   0.2% OSD::do_notifies
>>     17.0   0.1%  99.8%     23.1   0.1%
>> OSDService::build_incremental_map_msg
>>      9.8   0.0%  99.9%    222.5   0.8% std::enable_if::type decode
>>      6.2   0.0%  99.9%      6.3   0.0% std::map::operator[]
>>      5.5   0.0%  99.9%      5.5   0.0% std::vector::vector
>>      3.5   0.0%  99.9%      3.5   0.0% EventCenter::create_time_event
>>      2.5   0.0%  99.9%      2.5   0.0% AsyncConnection::AsyncConnection
>>      2.4   0.0% 100.0%      2.4   0.0% std::string::_Rep::_S_create
>>      1.5   0.0% 100.0%      1.5   0.0% std::_Rb_tree::_M_insert_unique
>>      1.4   0.0% 100.0%      1.4   0.0% std::list::operator=
>>      1.3   0.0% 100.0%      1.3   0.0% ceph::buffer::list::list
>>      0.9   0.0% 100.0%    204.1   0.8% decode_message
>>      0.8   0.0% 100.0%      0.8   0.0% OSD::send_failures
>>      0.7   0.0% 100.0%      0.9   0.0% void decode
>>      0.6   0.0% 100.0%      0.6   0.0% std::_Rb_tree::_M_insert_equal
>>      0.6   0.0% 100.0%      2.5   0.0% PG::queue_null
>>      0.6   0.0% 100.0%      1.8   0.0% AsyncMessenger::create_connect
>>      0.6   0.0% 100.0%      1.8   0.0% AsyncMessenger::add_accept
>>      0.5   0.0% 100.0%      0.5   0.0% boost::statechart::event::clone
>>      0.4   0.0% 100.0%      0.4   0.0% PG::queue_peering_event
>>      0.3   0.0% 100.0%      0.3   0.0% OSD::PeeringWQ::_enqueue
>>      0.3   0.0% 100.0%    148.6   0.5% OSD::_dispatch
>>      0.1   0.0% 100.0%    147.9   0.5% OSD::handle_osd_map
>>      0.1   0.0% 100.0%      0.1   0.0% std::deque::_M_push_back_aux
>>      0.1   0.0% 100.0%      0.2   0.0% SharedLRU::add
>>      0.1   0.0% 100.0%      0.1   0.0% OSD::PeeringWQ::_dequeue
>>      0.1   0.0% 100.0%      0.1   0.0% ceph::buffer::list::append@c4a9b0
>>      0.1   0.0% 100.0%      0.2   0.0% DispatchQueue::enqueue
>>      0.1   0.0% 100.0%    283.5   1.0% EventCenter::process_events
>>      0.1   0.0% 100.0%      0.1   0.0% HitSet::Params::create_impl
>>      0.1   0.0% 100.0%      0.1   0.0% SimpleLRU::clear_pinned
>>      0.0   0.0% 100.0%      0.0   0.0% std::_Rb_tree::_M_insert_
>>      0.0   0.0% 100.0%      0.2   0.0% TrackedOp::mark_event
>>      0.0   0.0% 100.0%      0.0   0.0% OSD::create_context
>>      0.0   0.0% 100.0%      0.0   0.0% std::_Hashtable::_M_allocate_node
>>      0.0   0.0% 100.0%      0.0   0.0% OSDMap::OSDMap
>>      0.0   0.0% 100.0%    281.6   1.0% AsyncConnection::process
>>      0.0   0.0% 100.0%  25802.4  95.4%
>> PG::RecoveryState::RecoveryMachine::send_notify
>>      0.0   0.0% 100.0%      0.0   0.0% SharedLRU::lru_add
>>      0.0   0.0% 100.0%      0.0   0.0% std::_Rb_tree::_M_insert_unique_
>>      0.0   0.0% 100.0%      0.1   0.0% OpTracker::unregister_inflight_op
>>      0.0   0.0% 100.0%      0.0   0.0% OSD::ms_verify_authorizer
>>      0.0   0.0% 100.0%      0.0   0.0% OSDService::_add_map
>>      0.0   0.0% 100.0%      0.1   0.0% OSD::wait_for_new_map
>>      0.0   0.0% 100.0%      0.5   0.0% OSD::handle_pg_notify
>>      0.0   0.0% 100.0%      0.0   0.0% std::__shared_count::__shared_count
>>      0.0   0.0% 100.0%      0.0   0.0% std::__shared_ptr::reset
>>      0.0   0.0% 100.0%     35.1   0.1% OSDMap::decode@b84080
>>      0.0   0.0% 100.0%      0.0   0.0% std::_Rb_tree::_M_emplace_unique
>>      0.0   0.0% 100.0%      0.0   0.0% std::vector::operator=
>>      0.0   0.0% 100.0%      0.0   0.0% MonClient::_renew_subs
>>      0.0   0.0% 100.0%      0.0   0.0% std::_Hashtable::_M_emplace
>>      0.0   0.0% 100.0%      0.0   0.0% PORT_Alloc_Util
>>      0.0   0.0% 100.0%      0.0   0.0% CryptoAES::get_key_handler
>>      0.0   0.0% 100.0%      0.0   0.0% get_auth_session_handler
>>      0.0   0.0% 100.0%      0.0   0.0% PosixWorker::connect
>>      0.0   0.0% 100.0%      0.0   0.0% ceph::buffer::list::append@c4a440
>>      0.0   0.0% 100.0%      0.0   0.0% std::vector::_M_fill_insert
>>      0.0   0.0% 100.0%      4.8   0.0% AsyncConnection::fault
>>      0.0   0.0% 100.0%      0.0   0.0% OSD::send_pg_stats
>>      0.0   0.0% 100.0%      0.0   0.0% AsyncMessenger::accept_conn
>>      0.0   0.0% 100.0%      0.0   0.0% PosixServerSocketImpl::accept
>>      0.0   0.0% 100.0%      9.3   0.0%
>> AsyncConnection::_process_connection
>>      0.0   0.0% 100.0%      0.2   0.0% FileStore::lfn_open
>>      0.0   0.0% 100.0%      0.0   0.0% ceph::buffer::list::append@c4a350
>>      0.0   0.0% 100.0%      0.0   0.0% crush_create
>>      0.0   0.0% 100.0%      0.1   0.0% MgrClient::send_report
>>      0.0   0.0% 100.0%      0.0   0.0% WBThrottle::queue_wb
>>      0.0   0.0% 100.0%      0.2   0.0% LogClient::_get_mon_log_message
>>      0.0   0.0% 100.0%      0.0   0.0% CryptoKey::_set_secret
>>      0.0   0.0% 100.0%      0.0   0.0% std::_Deque_base::_M_initialize_map
>>      0.0   0.0% 100.0%      0.1   0.0%
>> ThreadPool::BatchWorkQueue::_void_dequeue
>>      0.0   0.0% 100.0%      0.0   0.0% ceph::Formatter::create@ba6a50
>>      0.0   0.0% 100.0%      0.0   0.0% MonClient::schedule_tick
>>      0.0   0.0% 100.0%      0.1   0.0% OSD::tick
>>      0.0   0.0% 100.0%     37.6   0.1% OSD::tick_without_osd_lock
>>      0.0   0.0% 100.0%      0.0   0.0%
>> boost::spirit::classic::impl::get_definition
>>      0.0   0.0% 100.0%      9.4   0.0% MonClient::_send_mon_message
>>      0.0   0.0% 100.0%      0.0   0.0% DispatchQueue::queue_refused
>>      0.0   0.0% 100.0%      0.0   0.0% OSD::handle_command
>>      0.0   0.0% 100.0%      0.0   0.0% DispatchQueue::queue_accept
>>      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::_connect
>>      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::_stop
>>      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::accept
>>      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::handle_connect_msg
>>      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::mark_down
>>      0.0   0.0% 100.0%      0.0   0.0%
>> AsyncConnection::prepare_send_message
>>      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::read_bulk
>>      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::read_until
>>      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::send_keepalive
>>      0.0   0.0% 100.0%      3.3   0.0% AsyncConnection::wakeup_from
>>      0.0   0.0% 100.0%      1.8   0.0% AsyncMessenger::get_connection
>>      0.0   0.0% 100.0%      0.0   0.0% AsyncMessenger::reap_dead
>>      0.0   0.0% 100.0%      2.5   0.0% C_OnMapCommit::finish
>>      0.0   0.0% 100.0%      0.0   0.0%
>> CephXTicketHandler::verify_service_ticket_reply
>>      0.0   0.0% 100.0%      0.0   0.0%
>> CephXTicketManager::verify_service_ticket_reply
>>      0.0   0.0% 100.0%      0.0   0.0%
>> CephxAuthorizeHandler::verify_authorizer
>>      0.0   0.0% 100.0%      0.0   0.0% CephxClientHandler::handle_response
>>      0.0   0.0% 100.0%     40.9   0.2% Context::complete
>>      0.0   0.0% 100.0%      4.8   0.0% CrushWrapper::encode
>>      0.0   0.0% 100.0%      0.0   0.0% CryptoAESKeyHandler::decrypt
>>      0.0   0.0% 100.0%      0.0   0.0% CryptoKey::decode
>>      0.0   0.0% 100.0%    160.4   0.6%
>> DispatchQueue::DispatchThread::entry
>>      0.0   0.0% 100.0%    160.4   0.6% DispatchQueue::entry
>>      0.0   0.0% 100.0%      0.4   0.0% DispatchQueue::fast_dispatch
>>      0.0   0.0% 100.0%      0.4   0.0% DispatchQueue::pre_dispatch
>>      0.0   0.0% 100.0%      0.0   0.0% EntityName::set
>>      0.0   0.0% 100.0%      0.0   0.0% EpollDriver::event_wait
>>      0.0   0.0% 100.0%      3.0   0.0%
>> EventCenter::dispatch_event_external
>>      0.0   0.0% 100.0%      3.3   0.0% EventCenter::process_time_events
>>      0.0   0.0% 100.0%      3.0   0.0% EventCenter::wakeup
>>      0.0   0.0% 100.0%      0.0   0.0% FileJournal::prepare_entry
>>      0.0   0.0% 100.0%      0.2   0.0% FileStore::_do_op
>>      0.0   0.0% 100.0%      0.2   0.0% FileStore::_do_transaction
>>      0.0   0.0% 100.0%      0.2   0.0% FileStore::_do_transactions
>>      0.0   0.0% 100.0%      0.0   0.0% FileStore::_journaled_ahead
>>      0.0   0.0% 100.0%      0.2   0.0% FileStore::_write
>>      0.0   0.0% 100.0%      0.0   0.0% FileStore::queue_transactions
>>      0.0   0.0% 100.0%      2.6   0.0% Finisher::finisher_thread_entry
>>      0.0   0.0% 100.0%      0.1   0.0% FunctionContext::finish
>>      0.0   0.0% 100.0%      0.1   0.0% HitSet::Params::decode
>>      0.0   0.0% 100.0%      0.2   0.0% LogChannel::do_log@a90a00
>>      0.0   0.0% 100.0%      0.3   0.0% LogChannel::do_log@a91030
>>      0.0   0.0% 100.0%      0.2   0.0% LogClient::get_mon_log_message
>>      0.0   0.0% 100.0%      0.0   0.0% LogClient::handle_log_ack
>>      0.0   0.0% 100.0%      0.1   0.0% LogClient::queue
>>      0.0   0.0% 100.0%      0.3   0.0% LogClientTemp::~LogClientTemp
>>      0.0   0.0% 100.0%      0.0   0.0% MAuthReply::decode_payload
>>      0.0   0.0% 100.0%      0.0   0.0% MCommand::decode_payload
>>      0.0   0.0% 100.0%      0.0   0.0% MCommand::print
>>      0.0   0.0% 100.0%      0.0   0.0% MMgrMap::decode_payload
>>      0.0   0.0% 100.0%      0.0   0.0% MOSDFailure::print
>>      0.0   0.0% 100.0%      0.1   0.0% MOSDMap::decode_payload
>>      0.0   0.0% 100.0%    203.1   0.8% MOSDPGNotify::decode_payload
>>      0.0   0.0% 100.0%      0.0   0.0% MOSDPGNotify::print
>>      0.0   0.0% 100.0%      0.0   0.0% MOSDPing::encode_payload
>>      0.0   0.0% 100.0%      0.0   0.0% Message::encode
>>      0.0   0.0% 100.0%      0.0   0.0% MgrClient::handle_mgr_map
>>      0.0   0.0% 100.0%      0.0   0.0% MgrClient::ms_dispatch
>>      0.0   0.0% 100.0%      0.0   0.0% MgrMap::decode
>>      0.0   0.0% 100.0%      0.0   0.0% MonClient::_check_auth_rotating
>>      0.0   0.0% 100.0%      0.0   0.0% MonClient::_check_auth_tickets
>>      0.0   0.0% 100.0%      0.0   0.0% MonClient::_finish_hunting
>>      0.0   0.0% 100.0%      0.8   0.0% MonClient::_reopen_session@aeab80
>>      0.0   0.0% 100.0%      0.6   0.0% MonClient::_reopen_session@af2ba0
>>      0.0   0.0% 100.0%      9.5   0.0% MonClient::handle_auth
>>      0.0   0.0% 100.0%      9.6   0.0% MonClient::ms_dispatch
>>      0.0   0.0% 100.0%      0.2   0.0% MonClient::send_log
>>      0.0   0.0% 100.0%      0.6   0.0% MonClient::tick
>>      0.0   0.0% 100.0%    283.5   1.0% NetworkStack::get_worker
>>      0.0   0.0% 100.0%      0.0   0.0% OSD::CommandWQ::_process
>>      0.0   0.0% 100.0%  25862.5  95.7% OSD::PeeringWQ::_process
>>      0.0   0.0% 100.0%      0.0   0.0% OSD::Session::Session
>>      0.0   0.0% 100.0%    686.3   2.5% OSD::T_Heartbeat::entry
>>      0.0   0.0% 100.0%      2.5   0.0% OSD::_committed_osd_maps
>>      0.0   0.0% 100.0%  25804.6  95.5% OSD::advance_pg
>>      0.0   0.0% 100.0%      0.3   0.0% OSD::check_ops_in_flight
>>      0.0   0.0% 100.0%      0.0   0.0% OSD::check_osdmap_features
>>      0.0   0.0% 100.0%      2.5   0.0% OSD::consume_map
>>      0.0   0.0% 100.0%     57.8   0.2% OSD::dispatch_context
>>      0.0   0.0% 100.0%      0.5   0.0% OSD::dispatch_op
>>      0.0   0.0% 100.0%      0.0   0.0% OSD::do_command
>>      0.0   0.0% 100.0%      0.2   0.0% OSD::do_waiters
>>      0.0   0.0% 100.0%      0.0   0.0% OSD::get_osdmap_pobject_name
>>      0.0   0.0% 100.0%      0.1   0.0% OSD::handle_osd_ping
>>      0.0   0.0% 100.0%      0.0   0.0% OSD::handle_pg_peering_evt
>>      0.0   0.0% 100.0%     37.2   0.1% OSD::heartbeat_check
>>      0.0   0.0% 100.0%      0.1   0.0% OSD::heartbeat_dispatch
>>      0.0   0.0% 100.0%    686.3   2.5% OSD::heartbeat_entry
>>      0.0   0.0% 100.0%      1.1   0.0% OSD::heartbeat_reset
>>      0.0   0.0% 100.0%    148.7   0.6% OSD::ms_dispatch
>>      0.0   0.0% 100.0%      0.8   0.0% OSD::ms_handle_connect
>>      0.0   0.0% 100.0%      0.0   0.0% OSD::ms_handle_refused
>>      0.0   0.0% 100.0%      0.0   0.0% OSD::ms_handle_reset
>>      0.0   0.0% 100.0%  25862.5  95.7% OSD::process_peering_events
>>      0.0   0.0% 100.0%      0.1   0.0% OSD::require_same_or_newer_map
>>      0.0   0.0% 100.0%      0.0   0.0% OSD::write_superblock
>>      0.0   0.0% 100.0%      0.0   0.0% OSDCap::parse
>>      0.0   0.0% 100.0%      0.0   0.0% OSDMap::Incremental::decode
>>      0.0   0.0% 100.0%     35.1   0.1% OSDMap::decode@b85440
>>      0.0   0.0% 100.0%    110.8   0.4% OSDMap::encode
>>      0.0   0.0% 100.0%      0.5   0.0% OSDMap::post_decode
>>      0.0   0.0% 100.0%      0.1   0.0% OSDService::_get_map_bl
>>      0.0   0.0% 100.0%      0.0   0.0% OSDService::check_nearfull_warning
>>      0.0   0.0% 100.0%      0.1   0.0% OSDService::clear_map_bl_cache_pins
>>      0.0   0.0% 100.0%      1.1   0.0% OSDService::get_con_osd_hb
>>      0.0   0.0% 100.0%      1.3   0.0% OSDService::get_inc_map_bl
>>      0.0   0.0% 100.0%      1.3   0.0% OSDService::pin_map_bl
>>      0.0   0.0% 100.0%      0.0   0.0% OSDService::pin_map_inc_bl
>>      0.0   0.0% 100.0%      0.0   0.0% OSDService::publish_superblock
>>      0.0   0.0% 100.0%      0.3   0.0% OSDService::queue_for_peering
>>      0.0   0.0% 100.0%     27.2   0.1% OSDService::send_incremental_map
>>      0.0   0.0% 100.0%     27.2   0.1% OSDService::share_map_peer
>>      0.0   0.0% 100.0%      0.0   0.0% OSDService::update_osd_stat
>>      0.0   0.0% 100.0%      0.0   0.0%
>> ObjectStore::Transaction::_get_coll_id
>>      0.0   0.0% 100.0%      0.0   0.0%
>> ObjectStore::Transaction::_get_next_op
>>      0.0   0.0% 100.0%      0.2   0.0% ObjectStore::Transaction::write
>>      0.0   0.0% 100.0%      0.0   0.0% ObjectStore::queue_transaction
>>      0.0   0.0% 100.0%      0.0   0.0% Objecter::_maybe_request_map
>>      0.0   0.0% 100.0%      0.1   0.0% Objecter::handle_osd_map
>>      0.0   0.0% 100.0%      0.1   0.0% OpHistory::insert
>>      0.0   0.0% 100.0%      0.0   0.0% OpRequest::OpRequest
>>      0.0   0.0% 100.0%      0.1   0.0% OpRequest::mark_flag_point
>>      0.0   0.0% 100.0%      0.1   0.0% OpRequest::mark_started
>>      0.0   0.0% 100.0%      0.1   0.0% OpTracker::RemoveOnDelete::operator
>>      0.0   0.0% 100.0%      0.1   0.0% OpTracker::_mark_event
>>      0.0   0.0% 100.0%      0.0   0.0% OpTracker::get_age_ms_histogram
>>      0.0   0.0% 100.0%  25802.4  95.4% PG::RecoveryState::Stray::react
>>      0.0   0.0% 100.0%      0.0   0.0% PG::_prepare_write_info
>>      0.0   0.0% 100.0%  25802.4  95.4% PG::handle_activate_map
>>      0.0   0.0% 100.0%      1.6   0.0% PG::handle_advance_map
>>      0.0   0.0% 100.0%      0.0   0.0% PG::prepare_write_info
>>      0.0   0.0% 100.0%      0.0   0.0% PG::write_if_dirty
>>      0.0   0.0% 100.0%      1.6   0.0% PGPool::update
>>      0.0   0.0% 100.0%      0.0   0.0% PK11_FreeSymKey
>>      0.0   0.0% 100.0%      0.0   0.0% PK11_GetIVLength
>>      0.0   0.0% 100.0%      0.0   0.0% PK11_ImportSymKey
>>      0.0   0.0% 100.0%      0.0   0.0% PrebufferedStreambuf::overflow
>>      0.0   0.0% 100.0%      1.8   0.0% Processor::accept
>>      0.0   0.0% 100.0%      0.0   0.0% SECITEM_CopyItem_Util
>>      0.0   0.0% 100.0%      0.0   0.0% SafeTimer::add_event_after
>>      0.0   0.0% 100.0%      0.0   0.0% SafeTimer::add_event_at
>>      0.0   0.0% 100.0%     38.4   0.1% SafeTimer::timer_thread
>>      0.0   0.0% 100.0%     38.4   0.1% SafeTimerThread::entry
>>      0.0   0.0% 100.0%  25862.7  95.7% ThreadPool::WorkThread::entry
>>      0.0   0.0% 100.0%  25862.7  95.7% ThreadPool::worker
>>      0.0   0.0% 100.0%  27023.8 100.0% __clone
>>      0.0   0.0% 100.0%      0.0   0.0%
>> boost::detail::function::void_function_obj_invoker2::invoke
>>      0.0   0.0% 100.0%      0.0   0.0%
>> boost::proto::detail::default_assign::impl::operator
>>      0.0   0.0% 100.0%      0.0   0.0%
>> boost::spirit::classic::impl::concrete_parser::do_parse_virtual
>>      0.0   0.0% 100.0%      0.0   0.0% boost::spirit::qi::action::parse
>>      0.0   0.0% 100.0%      0.3   0.0%
>> boost::statechart::event_base::intrusive_from_this
>>      0.0   0.0% 100.0%  25802.4  95.4%
>> boost::statechart::simple_state::react_impl
>>      0.0   0.0% 100.0%  25802.4  95.4%
>> boost::statechart::state_machine::send_event
>>      0.0   0.0% 100.0%      0.0   0.0% ceph::Formatter::create@48b620
>>      0.0   0.0% 100.0%      0.4   0.0%
>> ceph::buffer::list::contiguous_appender::contiguous_appender
>>      0.0   0.0% 100.0%      2.4   0.0% ceph::buffer::list::crc32c
>>      0.0   0.0% 100.0%      0.1   0.0%
>> ceph::buffer::list::iterator_impl::copy
>>      0.0   0.0% 100.0%      0.0   0.0%
>> ceph::buffer::list::iterator_impl::copy_deep
>>      0.0   0.0% 100.0%      5.7   0.0%
>> ceph::buffer::list::iterator_impl::copy_shallow
>>      0.0   0.0% 100.0%      0.0   0.0% ceph::buffer::ptr::ptr
>>      0.0   0.0% 100.0%      0.0   0.0% ceph_heap_profiler_handle_command
>>      0.0   0.0% 100.0%      0.0   0.0% ceph_os_fremovexattr
>>      0.0   0.0% 100.0%      0.0   0.0% cephx_verify_authorizer
>>      0.0   0.0% 100.0%      0.0   0.0% cmdmap_from_json
>>      0.0   0.0% 100.0%      2.2   0.0% crush_hash_name
>>      0.0   0.0% 100.0%      0.1   0.0% decode
>>      0.0   0.0% 100.0%     20.1   0.1% entity_addr_t::encode
>>      0.0   0.0% 100.0%      0.0   0.0% get_str_vec
>>      0.0   0.0% 100.0%      0.0   0.0% int decode_decrypt@c15110
>>      0.0   0.0% 100.0%      0.0   0.0% int decode_decrypt@c15b90
>>      0.0   0.0% 100.0%      0.0   0.0%
>> json_spirit::Semantic_actions::new_name
>>      0.0   0.0% 100.0%      0.0   0.0%
>> json_spirit::Semantic_actions::new_str
>>      0.0   0.0% 100.0%      1.1   0.0% json_spirit::Value_impl::get_uint64
>>      0.0   0.0% 100.0%      0.0   0.0% json_spirit::get_str
>>      0.0   0.0% 100.0%      0.0   0.0% json_spirit::get_str_
>>      0.0   0.0% 100.0%      0.0   0.0% json_spirit::read
>>      0.0   0.0% 100.0%      0.0   0.0% json_spirit::read_range
>>      0.0   0.0% 100.0%      0.0   0.0% json_spirit::read_range_or_throw
>>      0.0   0.0% 100.0%      0.0   0.0% json_spirit::substitute_esc_chars
>>      0.0   0.0% 100.0%      0.0   0.0% operator<<@a91e90
>>      0.0   0.0% 100.0%      3.5   0.0% osd_info_t::encode
>>      0.0   0.0% 100.0%      4.4   0.0% osd_xinfo_t::encode
>>      0.0   0.0% 100.0%      0.1   0.0% pg_info_t::decode
>>      0.0   0.0% 100.0%      0.0   0.0% pg_info_t::operator=
>>      0.0   0.0% 100.0%      9.9   0.0% pg_info_t::pg_info_t
>>      0.0   0.0% 100.0%     87.5   0.3% pg_interval_t::decode
>>      0.0   0.0% 100.0%      1.0   0.0% pg_pool_t::decode
>>      0.0   0.0% 100.0%      1.8   0.0% pg_pool_t::encode
>>      0.0   0.0% 100.0%      0.0   0.0% pg_stat_t::decode
>>      0.0   0.0% 100.0%  27032.1 100.0% start_thread
>>      0.0   0.0% 100.0%      1.3   0.0% std::_Rb_tree::operator=
>>      0.0   0.0% 100.0%      0.1   0.0% std::_Sp_counted_base::_M_release
>>      0.0   0.0% 100.0%      0.0   0.0%
>> std::__detail::_Map_base::operator[]
>>      0.0   0.0% 100.0%      0.0   0.0% std::__ostream_insert
>>      0.0   0.0% 100.0%      0.1   0.0% std::basic_streambuf::xsputn
>>      0.0   0.0% 100.0%      0.1   0.0% std::basic_string::basic_string
>>      0.0   0.0% 100.0%      0.0   0.0% std::basic_stringbuf::overflow
>>      0.0   0.0% 100.0%      1.0   0.0% std::basic_stringbuf::str
>>      0.0   0.0% 100.0%     71.0   0.3% std::enable_if::type encode
>>      0.0   0.0% 100.0%      0.1   0.0% std::getline
>>      0.0   0.0% 100.0%      0.0   0.0% std::num_put::_M_insert_int
>>      0.0   0.0% 100.0%      0.0   0.0% std::num_put::do_put
>>      0.0   0.0% 100.0%      0.0   0.0% std::operator<<
>>      0.0   0.0% 100.0%      0.0   0.0% std::ostream::_M_insert
>>      0.0   0.0% 100.0%      1.2   0.0% std::string::_Rep::_M_clone
>>      0.0   0.0% 100.0%      1.2   0.0% std::string::_S_construct
>>      0.0   0.0% 100.0%      1.2   0.0% std::string::append
>>      0.0   0.0% 100.0%      1.2   0.0% std::string::reserve
>>      0.0   0.0% 100.0%    283.5   1.0% std::this_thread::__sleep_for
>>      0.0   0.0% 100.0%      0.0   0.0% void decode_decrypt_enc_bl@c12db0
>>      0.0   0.0% 100.0%      0.0   0.0% void decode_decrypt_enc_bl@c14a80
>>      0.0   0.0% 100.0%      0.0   0.0% void decode_decrypt_enc_bl@c15450
>>      0.0   0.0% 100.0%     20.1   0.1% void encode
>>
>>
>> I also generated the PDf with all the charts, but not sure how to share it
>> with you guys.
>> any Idea what is happening here ?
>>
>> thanks
>> ali
>>
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: High memory usage kills OSD while peering
       [not found]         ` <93debf2d-12cb-eceb-e9cd-5226ad49cc16@gmail.com>
@ 2017-08-21 15:18           ` Haomai Wang
  2017-08-21 16:05             ` Mustafa Muhammad
  2017-08-22  8:37             ` Linux Chips
  0 siblings, 2 replies; 27+ messages in thread
From: Haomai Wang @ 2017-08-21 15:18 UTC (permalink / raw)
  To: Linux Chips; +Cc: ceph-devel

from your previous mail, not dive into too much. but do you try ms
type = simple to see? although I don't find any obvious thing to do
this.

On Mon, Aug 21, 2017 at 11:14 PM, Linux Chips <linux.chips@gmail.com> wrote:
> we tried it, same thing. we lowered it to "2" along with other osd_map*
> values, little better, but we still get alot of oom kills and unable to
> fully start the cluster.
> we need to squeeze more memory out of those OSD.
>
>
> On 08/21/2017 04:07 PM, Haomai Wang wrote:
>
> do you try to lower osd_map_cache_size = 20 ?
>
> On Mon, Aug 21, 2017 at 6:57 PM, Linux Chips <linux.chips@gmail.com> wrote:
>
> Hi,
> I have an idea to move pools out of the "current" directory (like move them
> into a directory "current.bak"), and keep only one pool at a time in there
> so the OSD would load less PGs.
> any one tried to do this before? will we have a data loss?
>
>
>
> On 08/17/2017 09:51 PM, Linux Chips wrote:
>
>
> On 08/17/2017 08:53 PM, Gregory Farnum wrote:
>
> On Thu, Aug 17, 2017 at 7:13 AM, Linux Chips <linux.chips@gmail.com>
> wrote:
>
> Hello everybody,
> I have Kraken cluster with 660 OSD, currently it is down due to not
> being able to complete peering, OSDs start consuming lots of memory
> draining the system and killing the node, so I set a limit on the OSD
> service (on some OSDs 28G and others as high as 35G), so they get
> killed before taking down the whole node.
> Now I still can't peer, one OSD entering the cluster (with about 300
> already up) makes memory usage of most other OSDs so high (15G+, some as
> much as 30G) and
> sometimes kills them when they reach the service limit. which cause a
> spiral
> load and causing all the OSDs to consume all the available.
>
> I found this thread with similar symptoms:
>
>
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-April/017522.html
>
> with a request for stack trace, I have a 14G core dump, we generated it
> by
> running the osd from the terminal, enabling the core dumps, and setting
> ulimits to 15G. what kind of a trace would be useful? all thread?! any
> better way to debug this?
>
> What can I do do make it work, is this memory allocation normal?
>
> some info about the cluster:
> 41 hdd nodes with 12 x 4TB osd each, 5 of the nodes have 8TB disks. 324
> GB
> RAM and dula socket intel xeon.
> 7 nodes with 400GB x 24 ssd and 256GB RAM, and dual socket cpu.
> 3 monitors
>
> all dual 10GB ethernet, except for the monitor with dual 1GB ethers.
>
> all nodes running centos 7.2
> it is an old cluster that was upgraded continuously for the past 3
> years.
> the cluster was on jewel when the issue happened due to some accidental
> OSD
> map changes, causing a heavy recovery operations on the cluster. then we
> upgraded to kraken in the hope of less memory foot prints.
>
> any advice on how to proceed?
>
> It's not normal but if something really bad happened to your cluster,
> it's been known to occur. You should go through the troubleshooting
> guides at docs.ceph.com, but the general strategy is to set
> nodown/noout/etc flags, undo whatever horrible thing you tried to make
> the map do, and then turn all the OSDs back on.
> -Greg
>
> Hi,
> we have been trying this for the past week, it keeps consuming the RAM.
> we got the map back to the original places. marked all the flags, started
> all the OSDs. then "ceph osd unset noup", wait 5 min, and all OSDs are
> killed by the oom.
> we tried one node at a time, let it finish recovering, and start the next.
> we got to a point when we started the next node, every thing got killed.
> we tried one OSD at a time, same result. one OSD up, ~40 killed by oom,
> then it is a snow ball from here until all of the active OSDs get kiiled.
>
> I think all this up/down that we generated has increased the recovery too
> much. btw, we stopped all clients. and also we have some not so friendly
> erasure pools. some OSDs now report loading as much as 800 pg, while we
> originally had about 300-400 (I know too much, but we were trying to fix it
> and.... well we could not).
>
> we did a memory profiling on one of the OSDs.
> here is the results
>
>
>  12878.6  47.6%  47.6%  12878.6  47.6% std::_Rb_tree::_M_create_node
>  12867.6  47.6%  95.2%  25746.2  95.2% std::_Rb_tree::_M_copy
>    532.4   2.0%  97.2%    686.3   2.5% OSD::heartbeat
>    122.8   0.5%  97.7%    122.8   0.5%
> std::_Rb_tree::_M_emplace_hint_unique
>    121.9   0.5%  98.1%    171.1   0.6% AsyncConnection::send_message
>    104.2   0.4%  98.5%    104.2   0.4% ceph::buffer::list::append@c4a770
>     99.7   0.4%  98.9%     99.7   0.4% std::vector::_M_default_append
>     99.6   0.4%  99.2%     99.6   0.4% ceph::logging::Log::create_entry
>     72.6   0.3%  99.5%     72.6   0.3% ceph::buffer::create_aligned
>     52.4   0.2%  99.7%     52.5   0.2% std::vector::_M_emplace_back_aux
>     23.9   0.1%  99.8%     57.8   0.2% OSD::do_notifies
>     17.0   0.1%  99.8%     23.1   0.1%
> OSDService::build_incremental_map_msg
>      9.8   0.0%  99.9%    222.5   0.8% std::enable_if::type decode
>      6.2   0.0%  99.9%      6.3   0.0% std::map::operator[]
>      5.5   0.0%  99.9%      5.5   0.0% std::vector::vector
>      3.5   0.0%  99.9%      3.5   0.0% EventCenter::create_time_event
>      2.5   0.0%  99.9%      2.5   0.0% AsyncConnection::AsyncConnection
>      2.4   0.0% 100.0%      2.4   0.0% std::string::_Rep::_S_create
>      1.5   0.0% 100.0%      1.5   0.0% std::_Rb_tree::_M_insert_unique
>      1.4   0.0% 100.0%      1.4   0.0% std::list::operator=
>      1.3   0.0% 100.0%      1.3   0.0% ceph::buffer::list::list
>      0.9   0.0% 100.0%    204.1   0.8% decode_message
>      0.8   0.0% 100.0%      0.8   0.0% OSD::send_failures
>      0.7   0.0% 100.0%      0.9   0.0% void decode
>      0.6   0.0% 100.0%      0.6   0.0% std::_Rb_tree::_M_insert_equal
>      0.6   0.0% 100.0%      2.5   0.0% PG::queue_null
>      0.6   0.0% 100.0%      1.8   0.0% AsyncMessenger::create_connect
>      0.6   0.0% 100.0%      1.8   0.0% AsyncMessenger::add_accept
>      0.5   0.0% 100.0%      0.5   0.0% boost::statechart::event::clone
>      0.4   0.0% 100.0%      0.4   0.0% PG::queue_peering_event
>      0.3   0.0% 100.0%      0.3   0.0% OSD::PeeringWQ::_enqueue
>      0.3   0.0% 100.0%    148.6   0.5% OSD::_dispatch
>      0.1   0.0% 100.0%    147.9   0.5% OSD::handle_osd_map
>      0.1   0.0% 100.0%      0.1   0.0% std::deque::_M_push_back_aux
>      0.1   0.0% 100.0%      0.2   0.0% SharedLRU::add
>      0.1   0.0% 100.0%      0.1   0.0% OSD::PeeringWQ::_dequeue
>      0.1   0.0% 100.0%      0.1   0.0% ceph::buffer::list::append@c4a9b0
>      0.1   0.0% 100.0%      0.2   0.0% DispatchQueue::enqueue
>      0.1   0.0% 100.0%    283.5   1.0% EventCenter::process_events
>      0.1   0.0% 100.0%      0.1   0.0% HitSet::Params::create_impl
>      0.1   0.0% 100.0%      0.1   0.0% SimpleLRU::clear_pinned
>      0.0   0.0% 100.0%      0.0   0.0% std::_Rb_tree::_M_insert_
>      0.0   0.0% 100.0%      0.2   0.0% TrackedOp::mark_event
>      0.0   0.0% 100.0%      0.0   0.0% OSD::create_context
>      0.0   0.0% 100.0%      0.0   0.0% std::_Hashtable::_M_allocate_node
>      0.0   0.0% 100.0%      0.0   0.0% OSDMap::OSDMap
>      0.0   0.0% 100.0%    281.6   1.0% AsyncConnection::process
>      0.0   0.0% 100.0%  25802.4  95.4%
> PG::RecoveryState::RecoveryMachine::send_notify
>      0.0   0.0% 100.0%      0.0   0.0% SharedLRU::lru_add
>      0.0   0.0% 100.0%      0.0   0.0% std::_Rb_tree::_M_insert_unique_
>      0.0   0.0% 100.0%      0.1   0.0% OpTracker::unregister_inflight_op
>      0.0   0.0% 100.0%      0.0   0.0% OSD::ms_verify_authorizer
>      0.0   0.0% 100.0%      0.0   0.0% OSDService::_add_map
>      0.0   0.0% 100.0%      0.1   0.0% OSD::wait_for_new_map
>      0.0   0.0% 100.0%      0.5   0.0% OSD::handle_pg_notify
>      0.0   0.0% 100.0%      0.0   0.0% std::__shared_count::__shared_count
>      0.0   0.0% 100.0%      0.0   0.0% std::__shared_ptr::reset
>      0.0   0.0% 100.0%     35.1   0.1% OSDMap::decode@b84080
>      0.0   0.0% 100.0%      0.0   0.0% std::_Rb_tree::_M_emplace_unique
>      0.0   0.0% 100.0%      0.0   0.0% std::vector::operator=
>      0.0   0.0% 100.0%      0.0   0.0% MonClient::_renew_subs
>      0.0   0.0% 100.0%      0.0   0.0% std::_Hashtable::_M_emplace
>      0.0   0.0% 100.0%      0.0   0.0% PORT_Alloc_Util
>      0.0   0.0% 100.0%      0.0   0.0% CryptoAES::get_key_handler
>      0.0   0.0% 100.0%      0.0   0.0% get_auth_session_handler
>      0.0   0.0% 100.0%      0.0   0.0% PosixWorker::connect
>      0.0   0.0% 100.0%      0.0   0.0% ceph::buffer::list::append@c4a440
>      0.0   0.0% 100.0%      0.0   0.0% std::vector::_M_fill_insert
>      0.0   0.0% 100.0%      4.8   0.0% AsyncConnection::fault
>      0.0   0.0% 100.0%      0.0   0.0% OSD::send_pg_stats
>      0.0   0.0% 100.0%      0.0   0.0% AsyncMessenger::accept_conn
>      0.0   0.0% 100.0%      0.0   0.0% PosixServerSocketImpl::accept
>      0.0   0.0% 100.0%      9.3   0.0%
> AsyncConnection::_process_connection
>      0.0   0.0% 100.0%      0.2   0.0% FileStore::lfn_open
>      0.0   0.0% 100.0%      0.0   0.0% ceph::buffer::list::append@c4a350
>      0.0   0.0% 100.0%      0.0   0.0% crush_create
>      0.0   0.0% 100.0%      0.1   0.0% MgrClient::send_report
>      0.0   0.0% 100.0%      0.0   0.0% WBThrottle::queue_wb
>      0.0   0.0% 100.0%      0.2   0.0% LogClient::_get_mon_log_message
>      0.0   0.0% 100.0%      0.0   0.0% CryptoKey::_set_secret
>      0.0   0.0% 100.0%      0.0   0.0% std::_Deque_base::_M_initialize_map
>      0.0   0.0% 100.0%      0.1   0.0%
> ThreadPool::BatchWorkQueue::_void_dequeue
>      0.0   0.0% 100.0%      0.0   0.0% ceph::Formatter::create@ba6a50
>      0.0   0.0% 100.0%      0.0   0.0% MonClient::schedule_tick
>      0.0   0.0% 100.0%      0.1   0.0% OSD::tick
>      0.0   0.0% 100.0%     37.6   0.1% OSD::tick_without_osd_lock
>      0.0   0.0% 100.0%      0.0   0.0%
> boost::spirit::classic::impl::get_definition
>      0.0   0.0% 100.0%      9.4   0.0% MonClient::_send_mon_message
>      0.0   0.0% 100.0%      0.0   0.0% DispatchQueue::queue_refused
>      0.0   0.0% 100.0%      0.0   0.0% OSD::handle_command
>      0.0   0.0% 100.0%      0.0   0.0% DispatchQueue::queue_accept
>      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::_connect
>      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::_stop
>      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::accept
>      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::handle_connect_msg
>      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::mark_down
>      0.0   0.0% 100.0%      0.0   0.0%
> AsyncConnection::prepare_send_message
>      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::read_bulk
>      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::read_until
>      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::send_keepalive
>      0.0   0.0% 100.0%      3.3   0.0% AsyncConnection::wakeup_from
>      0.0   0.0% 100.0%      1.8   0.0% AsyncMessenger::get_connection
>      0.0   0.0% 100.0%      0.0   0.0% AsyncMessenger::reap_dead
>      0.0   0.0% 100.0%      2.5   0.0% C_OnMapCommit::finish
>      0.0   0.0% 100.0%      0.0   0.0%
> CephXTicketHandler::verify_service_ticket_reply
>      0.0   0.0% 100.0%      0.0   0.0%
> CephXTicketManager::verify_service_ticket_reply
>      0.0   0.0% 100.0%      0.0   0.0%
> CephxAuthorizeHandler::verify_authorizer
>      0.0   0.0% 100.0%      0.0   0.0% CephxClientHandler::handle_response
>      0.0   0.0% 100.0%     40.9   0.2% Context::complete
>      0.0   0.0% 100.0%      4.8   0.0% CrushWrapper::encode
>      0.0   0.0% 100.0%      0.0   0.0% CryptoAESKeyHandler::decrypt
>      0.0   0.0% 100.0%      0.0   0.0% CryptoKey::decode
>      0.0   0.0% 100.0%    160.4   0.6%
> DispatchQueue::DispatchThread::entry
>      0.0   0.0% 100.0%    160.4   0.6% DispatchQueue::entry
>      0.0   0.0% 100.0%      0.4   0.0% DispatchQueue::fast_dispatch
>      0.0   0.0% 100.0%      0.4   0.0% DispatchQueue::pre_dispatch
>      0.0   0.0% 100.0%      0.0   0.0% EntityName::set
>      0.0   0.0% 100.0%      0.0   0.0% EpollDriver::event_wait
>      0.0   0.0% 100.0%      3.0   0.0%
> EventCenter::dispatch_event_external
>      0.0   0.0% 100.0%      3.3   0.0% EventCenter::process_time_events
>      0.0   0.0% 100.0%      3.0   0.0% EventCenter::wakeup
>      0.0   0.0% 100.0%      0.0   0.0% FileJournal::prepare_entry
>      0.0   0.0% 100.0%      0.2   0.0% FileStore::_do_op
>      0.0   0.0% 100.0%      0.2   0.0% FileStore::_do_transaction
>      0.0   0.0% 100.0%      0.2   0.0% FileStore::_do_transactions
>      0.0   0.0% 100.0%      0.0   0.0% FileStore::_journaled_ahead
>      0.0   0.0% 100.0%      0.2   0.0% FileStore::_write
>      0.0   0.0% 100.0%      0.0   0.0% FileStore::queue_transactions
>      0.0   0.0% 100.0%      2.6   0.0% Finisher::finisher_thread_entry
>      0.0   0.0% 100.0%      0.1   0.0% FunctionContext::finish
>      0.0   0.0% 100.0%      0.1   0.0% HitSet::Params::decode
>      0.0   0.0% 100.0%      0.2   0.0% LogChannel::do_log@a90a00
>      0.0   0.0% 100.0%      0.3   0.0% LogChannel::do_log@a91030
>      0.0   0.0% 100.0%      0.2   0.0% LogClient::get_mon_log_message
>      0.0   0.0% 100.0%      0.0   0.0% LogClient::handle_log_ack
>      0.0   0.0% 100.0%      0.1   0.0% LogClient::queue
>      0.0   0.0% 100.0%      0.3   0.0% LogClientTemp::~LogClientTemp
>      0.0   0.0% 100.0%      0.0   0.0% MAuthReply::decode_payload
>      0.0   0.0% 100.0%      0.0   0.0% MCommand::decode_payload
>      0.0   0.0% 100.0%      0.0   0.0% MCommand::print
>      0.0   0.0% 100.0%      0.0   0.0% MMgrMap::decode_payload
>      0.0   0.0% 100.0%      0.0   0.0% MOSDFailure::print
>      0.0   0.0% 100.0%      0.1   0.0% MOSDMap::decode_payload
>      0.0   0.0% 100.0%    203.1   0.8% MOSDPGNotify::decode_payload
>      0.0   0.0% 100.0%      0.0   0.0% MOSDPGNotify::print
>      0.0   0.0% 100.0%      0.0   0.0% MOSDPing::encode_payload
>      0.0   0.0% 100.0%      0.0   0.0% Message::encode
>      0.0   0.0% 100.0%      0.0   0.0% MgrClient::handle_mgr_map
>      0.0   0.0% 100.0%      0.0   0.0% MgrClient::ms_dispatch
>      0.0   0.0% 100.0%      0.0   0.0% MgrMap::decode
>      0.0   0.0% 100.0%      0.0   0.0% MonClient::_check_auth_rotating
>      0.0   0.0% 100.0%      0.0   0.0% MonClient::_check_auth_tickets
>      0.0   0.0% 100.0%      0.0   0.0% MonClient::_finish_hunting
>      0.0   0.0% 100.0%      0.8   0.0% MonClient::_reopen_session@aeab80
>      0.0   0.0% 100.0%      0.6   0.0% MonClient::_reopen_session@af2ba0
>      0.0   0.0% 100.0%      9.5   0.0% MonClient::handle_auth
>      0.0   0.0% 100.0%      9.6   0.0% MonClient::ms_dispatch
>      0.0   0.0% 100.0%      0.2   0.0% MonClient::send_log
>      0.0   0.0% 100.0%      0.6   0.0% MonClient::tick
>      0.0   0.0% 100.0%    283.5   1.0% NetworkStack::get_worker
>      0.0   0.0% 100.0%      0.0   0.0% OSD::CommandWQ::_process
>      0.0   0.0% 100.0%  25862.5  95.7% OSD::PeeringWQ::_process
>      0.0   0.0% 100.0%      0.0   0.0% OSD::Session::Session
>      0.0   0.0% 100.0%    686.3   2.5% OSD::T_Heartbeat::entry
>      0.0   0.0% 100.0%      2.5   0.0% OSD::_committed_osd_maps
>      0.0   0.0% 100.0%  25804.6  95.5% OSD::advance_pg
>      0.0   0.0% 100.0%      0.3   0.0% OSD::check_ops_in_flight
>      0.0   0.0% 100.0%      0.0   0.0% OSD::check_osdmap_features
>      0.0   0.0% 100.0%      2.5   0.0% OSD::consume_map
>      0.0   0.0% 100.0%     57.8   0.2% OSD::dispatch_context
>      0.0   0.0% 100.0%      0.5   0.0% OSD::dispatch_op
>      0.0   0.0% 100.0%      0.0   0.0% OSD::do_command
>      0.0   0.0% 100.0%      0.2   0.0% OSD::do_waiters
>      0.0   0.0% 100.0%      0.0   0.0% OSD::get_osdmap_pobject_name
>      0.0   0.0% 100.0%      0.1   0.0% OSD::handle_osd_ping
>      0.0   0.0% 100.0%      0.0   0.0% OSD::handle_pg_peering_evt
>      0.0   0.0% 100.0%     37.2   0.1% OSD::heartbeat_check
>      0.0   0.0% 100.0%      0.1   0.0% OSD::heartbeat_dispatch
>      0.0   0.0% 100.0%    686.3   2.5% OSD::heartbeat_entry
>      0.0   0.0% 100.0%      1.1   0.0% OSD::heartbeat_reset
>      0.0   0.0% 100.0%    148.7   0.6% OSD::ms_dispatch
>      0.0   0.0% 100.0%      0.8   0.0% OSD::ms_handle_connect
>      0.0   0.0% 100.0%      0.0   0.0% OSD::ms_handle_refused
>      0.0   0.0% 100.0%      0.0   0.0% OSD::ms_handle_reset
>      0.0   0.0% 100.0%  25862.5  95.7% OSD::process_peering_events
>      0.0   0.0% 100.0%      0.1   0.0% OSD::require_same_or_newer_map
>      0.0   0.0% 100.0%      0.0   0.0% OSD::write_superblock
>      0.0   0.0% 100.0%      0.0   0.0% OSDCap::parse
>      0.0   0.0% 100.0%      0.0   0.0% OSDMap::Incremental::decode
>      0.0   0.0% 100.0%     35.1   0.1% OSDMap::decode@b85440
>      0.0   0.0% 100.0%    110.8   0.4% OSDMap::encode
>      0.0   0.0% 100.0%      0.5   0.0% OSDMap::post_decode
>      0.0   0.0% 100.0%      0.1   0.0% OSDService::_get_map_bl
>      0.0   0.0% 100.0%      0.0   0.0% OSDService::check_nearfull_warning
>      0.0   0.0% 100.0%      0.1   0.0% OSDService::clear_map_bl_cache_pins
>      0.0   0.0% 100.0%      1.1   0.0% OSDService::get_con_osd_hb
>      0.0   0.0% 100.0%      1.3   0.0% OSDService::get_inc_map_bl
>      0.0   0.0% 100.0%      1.3   0.0% OSDService::pin_map_bl
>      0.0   0.0% 100.0%      0.0   0.0% OSDService::pin_map_inc_bl
>      0.0   0.0% 100.0%      0.0   0.0% OSDService::publish_superblock
>      0.0   0.0% 100.0%      0.3   0.0% OSDService::queue_for_peering
>      0.0   0.0% 100.0%     27.2   0.1% OSDService::send_incremental_map
>      0.0   0.0% 100.0%     27.2   0.1% OSDService::share_map_peer
>      0.0   0.0% 100.0%      0.0   0.0% OSDService::update_osd_stat
>      0.0   0.0% 100.0%      0.0   0.0%
> ObjectStore::Transaction::_get_coll_id
>      0.0   0.0% 100.0%      0.0   0.0%
> ObjectStore::Transaction::_get_next_op
>      0.0   0.0% 100.0%      0.2   0.0% ObjectStore::Transaction::write
>      0.0   0.0% 100.0%      0.0   0.0% ObjectStore::queue_transaction
>      0.0   0.0% 100.0%      0.0   0.0% Objecter::_maybe_request_map
>      0.0   0.0% 100.0%      0.1   0.0% Objecter::handle_osd_map
>      0.0   0.0% 100.0%      0.1   0.0% OpHistory::insert
>      0.0   0.0% 100.0%      0.0   0.0% OpRequest::OpRequest
>      0.0   0.0% 100.0%      0.1   0.0% OpRequest::mark_flag_point
>      0.0   0.0% 100.0%      0.1   0.0% OpRequest::mark_started
>      0.0   0.0% 100.0%      0.1   0.0% OpTracker::RemoveOnDelete::operator
>      0.0   0.0% 100.0%      0.1   0.0% OpTracker::_mark_event
>      0.0   0.0% 100.0%      0.0   0.0% OpTracker::get_age_ms_histogram
>      0.0   0.0% 100.0%  25802.4  95.4% PG::RecoveryState::Stray::react
>      0.0   0.0% 100.0%      0.0   0.0% PG::_prepare_write_info
>      0.0   0.0% 100.0%  25802.4  95.4% PG::handle_activate_map
>      0.0   0.0% 100.0%      1.6   0.0% PG::handle_advance_map
>      0.0   0.0% 100.0%      0.0   0.0% PG::prepare_write_info
>      0.0   0.0% 100.0%      0.0   0.0% PG::write_if_dirty
>      0.0   0.0% 100.0%      1.6   0.0% PGPool::update
>      0.0   0.0% 100.0%      0.0   0.0% PK11_FreeSymKey
>      0.0   0.0% 100.0%      0.0   0.0% PK11_GetIVLength
>      0.0   0.0% 100.0%      0.0   0.0% PK11_ImportSymKey
>      0.0   0.0% 100.0%      0.0   0.0% PrebufferedStreambuf::overflow
>      0.0   0.0% 100.0%      1.8   0.0% Processor::accept
>      0.0   0.0% 100.0%      0.0   0.0% SECITEM_CopyItem_Util
>      0.0   0.0% 100.0%      0.0   0.0% SafeTimer::add_event_after
>      0.0   0.0% 100.0%      0.0   0.0% SafeTimer::add_event_at
>      0.0   0.0% 100.0%     38.4   0.1% SafeTimer::timer_thread
>      0.0   0.0% 100.0%     38.4   0.1% SafeTimerThread::entry
>      0.0   0.0% 100.0%  25862.7  95.7% ThreadPool::WorkThread::entry
>      0.0   0.0% 100.0%  25862.7  95.7% ThreadPool::worker
>      0.0   0.0% 100.0%  27023.8 100.0% __clone
>      0.0   0.0% 100.0%      0.0   0.0%
> boost::detail::function::void_function_obj_invoker2::invoke
>      0.0   0.0% 100.0%      0.0   0.0%
> boost::proto::detail::default_assign::impl::operator
>      0.0   0.0% 100.0%      0.0   0.0%
> boost::spirit::classic::impl::concrete_parser::do_parse_virtual
>      0.0   0.0% 100.0%      0.0   0.0% boost::spirit::qi::action::parse
>      0.0   0.0% 100.0%      0.3   0.0%
> boost::statechart::event_base::intrusive_from_this
>      0.0   0.0% 100.0%  25802.4  95.4%
> boost::statechart::simple_state::react_impl
>      0.0   0.0% 100.0%  25802.4  95.4%
> boost::statechart::state_machine::send_event
>      0.0   0.0% 100.0%      0.0   0.0% ceph::Formatter::create@48b620
>      0.0   0.0% 100.0%      0.4   0.0%
> ceph::buffer::list::contiguous_appender::contiguous_appender
>      0.0   0.0% 100.0%      2.4   0.0% ceph::buffer::list::crc32c
>      0.0   0.0% 100.0%      0.1   0.0%
> ceph::buffer::list::iterator_impl::copy
>      0.0   0.0% 100.0%      0.0   0.0%
> ceph::buffer::list::iterator_impl::copy_deep
>      0.0   0.0% 100.0%      5.7   0.0%
> ceph::buffer::list::iterator_impl::copy_shallow
>      0.0   0.0% 100.0%      0.0   0.0% ceph::buffer::ptr::ptr
>      0.0   0.0% 100.0%      0.0   0.0% ceph_heap_profiler_handle_command
>      0.0   0.0% 100.0%      0.0   0.0% ceph_os_fremovexattr
>      0.0   0.0% 100.0%      0.0   0.0% cephx_verify_authorizer
>      0.0   0.0% 100.0%      0.0   0.0% cmdmap_from_json
>      0.0   0.0% 100.0%      2.2   0.0% crush_hash_name
>      0.0   0.0% 100.0%      0.1   0.0% decode
>      0.0   0.0% 100.0%     20.1   0.1% entity_addr_t::encode
>      0.0   0.0% 100.0%      0.0   0.0% get_str_vec
>      0.0   0.0% 100.0%      0.0   0.0% int decode_decrypt@c15110
>      0.0   0.0% 100.0%      0.0   0.0% int decode_decrypt@c15b90
>      0.0   0.0% 100.0%      0.0   0.0%
> json_spirit::Semantic_actions::new_name
>      0.0   0.0% 100.0%      0.0   0.0%
> json_spirit::Semantic_actions::new_str
>      0.0   0.0% 100.0%      1.1   0.0% json_spirit::Value_impl::get_uint64
>      0.0   0.0% 100.0%      0.0   0.0% json_spirit::get_str
>      0.0   0.0% 100.0%      0.0   0.0% json_spirit::get_str_
>      0.0   0.0% 100.0%      0.0   0.0% json_spirit::read
>      0.0   0.0% 100.0%      0.0   0.0% json_spirit::read_range
>      0.0   0.0% 100.0%      0.0   0.0% json_spirit::read_range_or_throw
>      0.0   0.0% 100.0%      0.0   0.0% json_spirit::substitute_esc_chars
>      0.0   0.0% 100.0%      0.0   0.0% operator<<@a91e90
>      0.0   0.0% 100.0%      3.5   0.0% osd_info_t::encode
>      0.0   0.0% 100.0%      4.4   0.0% osd_xinfo_t::encode
>      0.0   0.0% 100.0%      0.1   0.0% pg_info_t::decode
>      0.0   0.0% 100.0%      0.0   0.0% pg_info_t::operator=
>      0.0   0.0% 100.0%      9.9   0.0% pg_info_t::pg_info_t
>      0.0   0.0% 100.0%     87.5   0.3% pg_interval_t::decode
>      0.0   0.0% 100.0%      1.0   0.0% pg_pool_t::decode
>      0.0   0.0% 100.0%      1.8   0.0% pg_pool_t::encode
>      0.0   0.0% 100.0%      0.0   0.0% pg_stat_t::decode
>      0.0   0.0% 100.0%  27032.1 100.0% start_thread
>      0.0   0.0% 100.0%      1.3   0.0% std::_Rb_tree::operator=
>      0.0   0.0% 100.0%      0.1   0.0% std::_Sp_counted_base::_M_release
>      0.0   0.0% 100.0%      0.0   0.0%
> std::__detail::_Map_base::operator[]
>      0.0   0.0% 100.0%      0.0   0.0% std::__ostream_insert
>      0.0   0.0% 100.0%      0.1   0.0% std::basic_streambuf::xsputn
>      0.0   0.0% 100.0%      0.1   0.0% std::basic_string::basic_string
>      0.0   0.0% 100.0%      0.0   0.0% std::basic_stringbuf::overflow
>      0.0   0.0% 100.0%      1.0   0.0% std::basic_stringbuf::str
>      0.0   0.0% 100.0%     71.0   0.3% std::enable_if::type encode
>      0.0   0.0% 100.0%      0.1   0.0% std::getline
>      0.0   0.0% 100.0%      0.0   0.0% std::num_put::_M_insert_int
>      0.0   0.0% 100.0%      0.0   0.0% std::num_put::do_put
>      0.0   0.0% 100.0%      0.0   0.0% std::operator<<
>      0.0   0.0% 100.0%      0.0   0.0% std::ostream::_M_insert
>      0.0   0.0% 100.0%      1.2   0.0% std::string::_Rep::_M_clone
>      0.0   0.0% 100.0%      1.2   0.0% std::string::_S_construct
>      0.0   0.0% 100.0%      1.2   0.0% std::string::append
>      0.0   0.0% 100.0%      1.2   0.0% std::string::reserve
>      0.0   0.0% 100.0%    283.5   1.0% std::this_thread::__sleep_for
>      0.0   0.0% 100.0%      0.0   0.0% void decode_decrypt_enc_bl@c12db0
>      0.0   0.0% 100.0%      0.0   0.0% void decode_decrypt_enc_bl@c14a80
>      0.0   0.0% 100.0%      0.0   0.0% void decode_decrypt_enc_bl@c15450
>      0.0   0.0% 100.0%     20.1   0.1% void encode
>
>
> I also generated the PDf with all the charts, but not sure how to share it
> with you guys.
> any Idea what is happening here ?
>
> thanks
> ali
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: High memory usage kills OSD while peering
  2017-08-21 15:18           ` Haomai Wang
@ 2017-08-21 16:05             ` Mustafa Muhammad
  2017-08-22  8:37             ` Linux Chips
  1 sibling, 0 replies; 27+ messages in thread
From: Mustafa Muhammad @ 2017-08-21 16:05 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Linux Chips, ceph-devel

Do you think changing osd_peering_wq_batch_size help? should we
increase it to delay peering? or that will backfire?

Regards
Mustafa

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: High memory usage kills OSD while peering
  2017-08-21 15:18           ` Haomai Wang
  2017-08-21 16:05             ` Mustafa Muhammad
@ 2017-08-22  8:37             ` Linux Chips
  1 sibling, 0 replies; 27+ messages in thread
From: Linux Chips @ 2017-08-22  8:37 UTC (permalink / raw)
  To: Haomai Wang; +Cc: ceph-devel

On 08/21/2017 06:18 PM, Haomai Wang wrote:
> from your previous mail, not dive into too much. but do you try ms
> type = simple to see? although I don't find any obvious thing to do
> this.
> 
> On Mon, Aug 21, 2017 at 11:14 PM, Linux Chips <linux.chips@gmail.com> wrote:
>> we tried it, same thing. we lowered it to "2" along with other osd_map*
>> values, little better, but we still get alot of oom kills and unable to
>> fully start the cluster.
>> we need to squeeze more memory out of those OSD.
>>
>>
>> On 08/21/2017 04:07 PM, Haomai Wang wrote:
>>
>> do you try to lower osd_map_cache_size = 20 ?
>>
>> On Mon, Aug 21, 2017 at 6:57 PM, Linux Chips <linux.chips@gmail.com> wrote:
>>
we did try the "ms type = simple". not that difference.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: High memory usage kills OSD while peering
  2017-08-19 16:38     ` Mustafa Muhammad
@ 2017-08-22 22:33       ` Sage Weil
  2017-08-23 12:21         ` Linux Chips
  0 siblings, 1 reply; 27+ messages in thread
From: Sage Weil @ 2017-08-22 22:33 UTC (permalink / raw)
  To: Mustafa Muhammad; +Cc: ceph-devel

One other trick that has been used here: if you look inside the PG 
directories on the OSDs and find that they are mostly empty then it's 
possible some of the memory and peering overhead is related to 
empty and useless PG instances on the wrong OSDs.  You can write a script 
to find empty directories (or ones that only contain the single pgmeta 
object with a mostly-empty name) and remove them (using 
ceph-objectstore-tool).  (For safety I'd recommend doing 
ceph-objectstore-tool export first, just in case there is some useful 
metadata there.)

That will only help if most of the pg dirs look empty, though.  If so, 
it's worth a shot!

The other thing we once did was use a kludge patch to trim the 
past_intervals metadata, which was respnosible for most of the memory 
usage.  I can't tell from the profile in this thread if that is the case 
or not.  There is a patch floating around in git somewhere that can be 
reused if it looks like that is the thing consuming the memory.

sage


 On Sat, 19 
Aug 2017, Mustafa Muhammad wrote:

> Hi all,
> Looks like the memory is consumed in the
> "PG::RecoveryState::RecoveryMachine::send_notify", is this related to
> messenger? Can we get lower memory usage even if this mean slower
> peering (or delayed recovery)?
> 
> Thanks in advance
> 
> Mustafa Muhammad
> 
> 
> On Thu, Aug 17, 2017 at 9:51 PM, Linux Chips <linux.chips@gmail.com> wrote:
> >
> >
> > On 08/17/2017 08:53 PM, Gregory Farnum wrote:
> >>
> >> On Thu, Aug 17, 2017 at 7:13 AM, Linux Chips <linux.chips@gmail.com>
> >> wrote:
> >>>
> >>> Hello everybody,
> >>> I have Kraken cluster with 660 OSD, currently it is down due to not
> >>> being able to complete peering, OSDs start consuming lots of memory
> >>> draining the system and killing the node, so I set a limit on the OSD
> >>> service (on some OSDs 28G and others as high as 35G), so they get
> >>> killed before taking down the whole node.
> >>> Now I still can't peer, one OSD entering the cluster (with about 300
> >>> already up) makes memory usage of most other OSDs so high (15G+, some as
> >>> much as 30G) and
> >>> sometimes kills them when they reach the service limit. which cause a
> >>> spiral
> >>> load and causing all the OSDs to consume all the available.
> >>>
> >>> I found this thread with similar symptoms:
> >>>
> >>>
> >>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-April/017522.html
> >>>
> >>> with a request for stack trace, I have a 14G core dump, we generated it
> >>> by
> >>> running the osd from the terminal, enabling the core dumps, and setting
> >>> ulimits to 15G. what kind of a trace would be useful? all thread?! any
> >>> better way to debug this?
> >>>
> >>> What can I do do make it work, is this memory allocation normal?
> >>>
> >>> some info about the cluster:
> >>> 41 hdd nodes with 12 x 4TB osd each, 5 of the nodes have 8TB disks. 324
> >>> GB
> >>> RAM and dula socket intel xeon.
> >>> 7 nodes with 400GB x 24 ssd and 256GB RAM, and dual socket cpu.
> >>> 3 monitors
> >>>
> >>> all dual 10GB ethernet, except for the monitor with dual 1GB ethers.
> >>>
> >>> all nodes running centos 7.2
> >>> it is an old cluster that was upgraded continuously for the past 3 years.
> >>> the cluster was on jewel when the issue happened due to some accidental
> >>> OSD
> >>> map changes, causing a heavy recovery operations on the cluster. then we
> >>> upgraded to kraken in the hope of less memory foot prints.
> >>>
> >>> any advice on how to proceed?
> >>
> >> It's not normal but if something really bad happened to your cluster,
> >> it's been known to occur. You should go through the troubleshooting
> >> guides at docs.ceph.com, but the general strategy is to set
> >> nodown/noout/etc flags, undo whatever horrible thing you tried to make
> >> the map do, and then turn all the OSDs back on.
> >> -Greg
> >
> >
> > Hi,
> > we have been trying this for the past week, it keeps consuming the RAM.
> > we got the map back to the original places. marked all the flags, started
> > all the OSDs. then "ceph osd unset noup", wait 5 min, and all OSDs are
> > killed by the oom.
> > we tried one node at a time, let it finish recovering, and start the next.
> > we got to a point when we started the next node, every thing got killed.
> > we tried one OSD at a time, same result. one OSD up, ~40 killed by oom, then
> > it is a snow ball from here until all of the active OSDs get kiiled.
> >
> > I think all this up/down that we generated has increased the recovery too
> > much. btw, we stopped all clients. and also we have some not so friendly
> > erasure pools. some OSDs now report loading as much as 800 pg, while we
> > originally had about 300-400 (I know too much, but we were trying to fix it
> > and.... well we could not).
> >
> > we did a memory profiling on one of the OSDs.
> > here is the results
> >
> >
> >  12878.6  47.6%  47.6%  12878.6  47.6% std::_Rb_tree::_M_create_node
> >  12867.6  47.6%  95.2%  25746.2  95.2% std::_Rb_tree::_M_copy
> >    532.4   2.0%  97.2%    686.3   2.5% OSD::heartbeat
> >    122.8   0.5%  97.7%    122.8   0.5% std::_Rb_tree::_M_emplace_hint_unique
> >    121.9   0.5%  98.1%    171.1   0.6% AsyncConnection::send_message
> >    104.2   0.4%  98.5%    104.2   0.4% ceph::buffer::list::append@c4a770
> >     99.7   0.4%  98.9%     99.7   0.4% std::vector::_M_default_append
> >     99.6   0.4%  99.2%     99.6   0.4% ceph::logging::Log::create_entry
> >     72.6   0.3%  99.5%     72.6   0.3% ceph::buffer::create_aligned
> >     52.4   0.2%  99.7%     52.5   0.2% std::vector::_M_emplace_back_aux
> >     23.9   0.1%  99.8%     57.8   0.2% OSD::do_notifies
> >     17.0   0.1%  99.8%     23.1   0.1% OSDService::build_incremental_map_msg
> >      9.8   0.0%  99.9%    222.5   0.8% std::enable_if::type decode
> >      6.2   0.0%  99.9%      6.3   0.0% std::map::operator[]
> >      5.5   0.0%  99.9%      5.5   0.0% std::vector::vector
> >      3.5   0.0%  99.9%      3.5   0.0% EventCenter::create_time_event
> >      2.5   0.0%  99.9%      2.5   0.0% AsyncConnection::AsyncConnection
> >      2.4   0.0% 100.0%      2.4   0.0% std::string::_Rep::_S_create
> >      1.5   0.0% 100.0%      1.5   0.0% std::_Rb_tree::_M_insert_unique
> >      1.4   0.0% 100.0%      1.4   0.0% std::list::operator=
> >      1.3   0.0% 100.0%      1.3   0.0% ceph::buffer::list::list
> >      0.9   0.0% 100.0%    204.1   0.8% decode_message
> >      0.8   0.0% 100.0%      0.8   0.0% OSD::send_failures
> >      0.7   0.0% 100.0%      0.9   0.0% void decode
> >      0.6   0.0% 100.0%      0.6   0.0% std::_Rb_tree::_M_insert_equal
> >      0.6   0.0% 100.0%      2.5   0.0% PG::queue_null
> >      0.6   0.0% 100.0%      1.8   0.0% AsyncMessenger::create_connect
> >      0.6   0.0% 100.0%      1.8   0.0% AsyncMessenger::add_accept
> >      0.5   0.0% 100.0%      0.5   0.0% boost::statechart::event::clone
> >      0.4   0.0% 100.0%      0.4   0.0% PG::queue_peering_event
> >      0.3   0.0% 100.0%      0.3   0.0% OSD::PeeringWQ::_enqueue
> >      0.3   0.0% 100.0%    148.6   0.5% OSD::_dispatch
> >      0.1   0.0% 100.0%    147.9   0.5% OSD::handle_osd_map
> >      0.1   0.0% 100.0%      0.1   0.0% std::deque::_M_push_back_aux
> >      0.1   0.0% 100.0%      0.2   0.0% SharedLRU::add
> >      0.1   0.0% 100.0%      0.1   0.0% OSD::PeeringWQ::_dequeue
> >      0.1   0.0% 100.0%      0.1   0.0% ceph::buffer::list::append@c4a9b0
> >      0.1   0.0% 100.0%      0.2   0.0% DispatchQueue::enqueue
> >      0.1   0.0% 100.0%    283.5   1.0% EventCenter::process_events
> >      0.1   0.0% 100.0%      0.1   0.0% HitSet::Params::create_impl
> >      0.1   0.0% 100.0%      0.1   0.0% SimpleLRU::clear_pinned
> >      0.0   0.0% 100.0%      0.0   0.0% std::_Rb_tree::_M_insert_
> >      0.0   0.0% 100.0%      0.2   0.0% TrackedOp::mark_event
> >      0.0   0.0% 100.0%      0.0   0.0% OSD::create_context
> >      0.0   0.0% 100.0%      0.0   0.0% std::_Hashtable::_M_allocate_node
> >      0.0   0.0% 100.0%      0.0   0.0% OSDMap::OSDMap
> >      0.0   0.0% 100.0%    281.6   1.0% AsyncConnection::process
> >      0.0   0.0% 100.0%  25802.4  95.4%
> > PG::RecoveryState::RecoveryMachine::send_notify
> >      0.0   0.0% 100.0%      0.0   0.0% SharedLRU::lru_add
> >      0.0   0.0% 100.0%      0.0   0.0% std::_Rb_tree::_M_insert_unique_
> >      0.0   0.0% 100.0%      0.1   0.0% OpTracker::unregister_inflight_op
> >      0.0   0.0% 100.0%      0.0   0.0% OSD::ms_verify_authorizer
> >      0.0   0.0% 100.0%      0.0   0.0% OSDService::_add_map
> >      0.0   0.0% 100.0%      0.1   0.0% OSD::wait_for_new_map
> >      0.0   0.0% 100.0%      0.5   0.0% OSD::handle_pg_notify
> >      0.0   0.0% 100.0%      0.0   0.0% std::__shared_count::__shared_count
> >      0.0   0.0% 100.0%      0.0   0.0% std::__shared_ptr::reset
> >      0.0   0.0% 100.0%     35.1   0.1% OSDMap::decode@b84080
> >      0.0   0.0% 100.0%      0.0   0.0% std::_Rb_tree::_M_emplace_unique
> >      0.0   0.0% 100.0%      0.0   0.0% std::vector::operator=
> >      0.0   0.0% 100.0%      0.0   0.0% MonClient::_renew_subs
> >      0.0   0.0% 100.0%      0.0   0.0% std::_Hashtable::_M_emplace
> >      0.0   0.0% 100.0%      0.0   0.0% PORT_Alloc_Util
> >      0.0   0.0% 100.0%      0.0   0.0% CryptoAES::get_key_handler
> >      0.0   0.0% 100.0%      0.0   0.0% get_auth_session_handler
> >      0.0   0.0% 100.0%      0.0   0.0% PosixWorker::connect
> >      0.0   0.0% 100.0%      0.0   0.0% ceph::buffer::list::append@c4a440
> >      0.0   0.0% 100.0%      0.0   0.0% std::vector::_M_fill_insert
> >      0.0   0.0% 100.0%      4.8   0.0% AsyncConnection::fault
> >      0.0   0.0% 100.0%      0.0   0.0% OSD::send_pg_stats
> >      0.0   0.0% 100.0%      0.0   0.0% AsyncMessenger::accept_conn
> >      0.0   0.0% 100.0%      0.0   0.0% PosixServerSocketImpl::accept
> >      0.0   0.0% 100.0%      9.3   0.0% AsyncConnection::_process_connection
> >      0.0   0.0% 100.0%      0.2   0.0% FileStore::lfn_open
> >      0.0   0.0% 100.0%      0.0   0.0% ceph::buffer::list::append@c4a350
> >      0.0   0.0% 100.0%      0.0   0.0% crush_create
> >      0.0   0.0% 100.0%      0.1   0.0% MgrClient::send_report
> >      0.0   0.0% 100.0%      0.0   0.0% WBThrottle::queue_wb
> >      0.0   0.0% 100.0%      0.2   0.0% LogClient::_get_mon_log_message
> >      0.0   0.0% 100.0%      0.0   0.0% CryptoKey::_set_secret
> >      0.0   0.0% 100.0%      0.0   0.0% std::_Deque_base::_M_initialize_map
> >      0.0   0.0% 100.0%      0.1   0.0%
> > ThreadPool::BatchWorkQueue::_void_dequeue
> >      0.0   0.0% 100.0%      0.0   0.0% ceph::Formatter::create@ba6a50
> >      0.0   0.0% 100.0%      0.0   0.0% MonClient::schedule_tick
> >      0.0   0.0% 100.0%      0.1   0.0% OSD::tick
> >      0.0   0.0% 100.0%     37.6   0.1% OSD::tick_without_osd_lock
> >      0.0   0.0% 100.0%      0.0   0.0%
> > boost::spirit::classic::impl::get_definition
> >      0.0   0.0% 100.0%      9.4   0.0% MonClient::_send_mon_message
> >      0.0   0.0% 100.0%      0.0   0.0% DispatchQueue::queue_refused
> >      0.0   0.0% 100.0%      0.0   0.0% OSD::handle_command
> >      0.0   0.0% 100.0%      0.0   0.0% DispatchQueue::queue_accept
> >      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::_connect
> >      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::_stop
> >      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::accept
> >      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::handle_connect_msg
> >      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::mark_down
> >      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::prepare_send_message
> >      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::read_bulk
> >      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::read_until
> >      0.0   0.0% 100.0%      0.0   0.0% AsyncConnection::send_keepalive
> >      0.0   0.0% 100.0%      3.3   0.0% AsyncConnection::wakeup_from
> >      0.0   0.0% 100.0%      1.8   0.0% AsyncMessenger::get_connection
> >      0.0   0.0% 100.0%      0.0   0.0% AsyncMessenger::reap_dead
> >      0.0   0.0% 100.0%      2.5   0.0% C_OnMapCommit::finish
> >      0.0   0.0% 100.0%      0.0   0.0%
> > CephXTicketHandler::verify_service_ticket_reply
> >      0.0   0.0% 100.0%      0.0   0.0%
> > CephXTicketManager::verify_service_ticket_reply
> >      0.0   0.0% 100.0%      0.0   0.0%
> > CephxAuthorizeHandler::verify_authorizer
> >      0.0   0.0% 100.0%      0.0   0.0% CephxClientHandler::handle_response
> >      0.0   0.0% 100.0%     40.9   0.2% Context::complete
> >      0.0   0.0% 100.0%      4.8   0.0% CrushWrapper::encode
> >      0.0   0.0% 100.0%      0.0   0.0% CryptoAESKeyHandler::decrypt
> >      0.0   0.0% 100.0%      0.0   0.0% CryptoKey::decode
> >      0.0   0.0% 100.0%    160.4   0.6% DispatchQueue::DispatchThread::entry
> >      0.0   0.0% 100.0%    160.4   0.6% DispatchQueue::entry
> >      0.0   0.0% 100.0%      0.4   0.0% DispatchQueue::fast_dispatch
> >      0.0   0.0% 100.0%      0.4   0.0% DispatchQueue::pre_dispatch
> >      0.0   0.0% 100.0%      0.0   0.0% EntityName::set
> >      0.0   0.0% 100.0%      0.0   0.0% EpollDriver::event_wait
> >      0.0   0.0% 100.0%      3.0   0.0% EventCenter::dispatch_event_external
> >      0.0   0.0% 100.0%      3.3   0.0% EventCenter::process_time_events
> >      0.0   0.0% 100.0%      3.0   0.0% EventCenter::wakeup
> >      0.0   0.0% 100.0%      0.0   0.0% FileJournal::prepare_entry
> >      0.0   0.0% 100.0%      0.2   0.0% FileStore::_do_op
> >      0.0   0.0% 100.0%      0.2   0.0% FileStore::_do_transaction
> >      0.0   0.0% 100.0%      0.2   0.0% FileStore::_do_transactions
> >      0.0   0.0% 100.0%      0.0   0.0% FileStore::_journaled_ahead
> >      0.0   0.0% 100.0%      0.2   0.0% FileStore::_write
> >      0.0   0.0% 100.0%      0.0   0.0% FileStore::queue_transactions
> >      0.0   0.0% 100.0%      2.6   0.0% Finisher::finisher_thread_entry
> >      0.0   0.0% 100.0%      0.1   0.0% FunctionContext::finish
> >      0.0   0.0% 100.0%      0.1   0.0% HitSet::Params::decode
> >      0.0   0.0% 100.0%      0.2   0.0% LogChannel::do_log@a90a00
> >      0.0   0.0% 100.0%      0.3   0.0% LogChannel::do_log@a91030
> >      0.0   0.0% 100.0%      0.2   0.0% LogClient::get_mon_log_message
> >      0.0   0.0% 100.0%      0.0   0.0% LogClient::handle_log_ack
> >      0.0   0.0% 100.0%      0.1   0.0% LogClient::queue
> >      0.0   0.0% 100.0%      0.3   0.0% LogClientTemp::~LogClientTemp
> >      0.0   0.0% 100.0%      0.0   0.0% MAuthReply::decode_payload
> >      0.0   0.0% 100.0%      0.0   0.0% MCommand::decode_payload
> >      0.0   0.0% 100.0%      0.0   0.0% MCommand::print
> >      0.0   0.0% 100.0%      0.0   0.0% MMgrMap::decode_payload
> >      0.0   0.0% 100.0%      0.0   0.0% MOSDFailure::print
> >      0.0   0.0% 100.0%      0.1   0.0% MOSDMap::decode_payload
> >      0.0   0.0% 100.0%    203.1   0.8% MOSDPGNotify::decode_payload
> >      0.0   0.0% 100.0%      0.0   0.0% MOSDPGNotify::print
> >      0.0   0.0% 100.0%      0.0   0.0% MOSDPing::encode_payload
> >      0.0   0.0% 100.0%      0.0   0.0% Message::encode
> >      0.0   0.0% 100.0%      0.0   0.0% MgrClient::handle_mgr_map
> >      0.0   0.0% 100.0%      0.0   0.0% MgrClient::ms_dispatch
> >      0.0   0.0% 100.0%      0.0   0.0% MgrMap::decode
> >      0.0   0.0% 100.0%      0.0   0.0% MonClient::_check_auth_rotating
> >      0.0   0.0% 100.0%      0.0   0.0% MonClient::_check_auth_tickets
> >      0.0   0.0% 100.0%      0.0   0.0% MonClient::_finish_hunting
> >      0.0   0.0% 100.0%      0.8   0.0% MonClient::_reopen_session@aeab80
> >      0.0   0.0% 100.0%      0.6   0.0% MonClient::_reopen_session@af2ba0
> >      0.0   0.0% 100.0%      9.5   0.0% MonClient::handle_auth
> >      0.0   0.0% 100.0%      9.6   0.0% MonClient::ms_dispatch
> >      0.0   0.0% 100.0%      0.2   0.0% MonClient::send_log
> >      0.0   0.0% 100.0%      0.6   0.0% MonClient::tick
> >      0.0   0.0% 100.0%    283.5   1.0% NetworkStack::get_worker
> >      0.0   0.0% 100.0%      0.0   0.0% OSD::CommandWQ::_process
> >      0.0   0.0% 100.0%  25862.5  95.7% OSD::PeeringWQ::_process
> >      0.0   0.0% 100.0%      0.0   0.0% OSD::Session::Session
> >      0.0   0.0% 100.0%    686.3   2.5% OSD::T_Heartbeat::entry
> >      0.0   0.0% 100.0%      2.5   0.0% OSD::_committed_osd_maps
> >      0.0   0.0% 100.0%  25804.6  95.5% OSD::advance_pg
> >      0.0   0.0% 100.0%      0.3   0.0% OSD::check_ops_in_flight
> >      0.0   0.0% 100.0%      0.0   0.0% OSD::check_osdmap_features
> >      0.0   0.0% 100.0%      2.5   0.0% OSD::consume_map
> >      0.0   0.0% 100.0%     57.8   0.2% OSD::dispatch_context
> >      0.0   0.0% 100.0%      0.5   0.0% OSD::dispatch_op
> >      0.0   0.0% 100.0%      0.0   0.0% OSD::do_command
> >      0.0   0.0% 100.0%      0.2   0.0% OSD::do_waiters
> >      0.0   0.0% 100.0%      0.0   0.0% OSD::get_osdmap_pobject_name
> >      0.0   0.0% 100.0%      0.1   0.0% OSD::handle_osd_ping
> >      0.0   0.0% 100.0%      0.0   0.0% OSD::handle_pg_peering_evt
> >      0.0   0.0% 100.0%     37.2   0.1% OSD::heartbeat_check
> >      0.0   0.0% 100.0%      0.1   0.0% OSD::heartbeat_dispatch
> >      0.0   0.0% 100.0%    686.3   2.5% OSD::heartbeat_entry
> >      0.0   0.0% 100.0%      1.1   0.0% OSD::heartbeat_reset
> >      0.0   0.0% 100.0%    148.7   0.6% OSD::ms_dispatch
> >      0.0   0.0% 100.0%      0.8   0.0% OSD::ms_handle_connect
> >      0.0   0.0% 100.0%      0.0   0.0% OSD::ms_handle_refused
> >      0.0   0.0% 100.0%      0.0   0.0% OSD::ms_handle_reset
> >      0.0   0.0% 100.0%  25862.5  95.7% OSD::process_peering_events
> >      0.0   0.0% 100.0%      0.1   0.0% OSD::require_same_or_newer_map
> >      0.0   0.0% 100.0%      0.0   0.0% OSD::write_superblock
> >      0.0   0.0% 100.0%      0.0   0.0% OSDCap::parse
> >      0.0   0.0% 100.0%      0.0   0.0% OSDMap::Incremental::decode
> >      0.0   0.0% 100.0%     35.1   0.1% OSDMap::decode@b85440
> >      0.0   0.0% 100.0%    110.8   0.4% OSDMap::encode
> >      0.0   0.0% 100.0%      0.5   0.0% OSDMap::post_decode
> >      0.0   0.0% 100.0%      0.1   0.0% OSDService::_get_map_bl
> >      0.0   0.0% 100.0%      0.0   0.0% OSDService::check_nearfull_warning
> >      0.0   0.0% 100.0%      0.1   0.0% OSDService::clear_map_bl_cache_pins
> >      0.0   0.0% 100.0%      1.1   0.0% OSDService::get_con_osd_hb
> >      0.0   0.0% 100.0%      1.3   0.0% OSDService::get_inc_map_bl
> >      0.0   0.0% 100.0%      1.3   0.0% OSDService::pin_map_bl
> >      0.0   0.0% 100.0%      0.0   0.0% OSDService::pin_map_inc_bl
> >      0.0   0.0% 100.0%      0.0   0.0% OSDService::publish_superblock
> >      0.0   0.0% 100.0%      0.3   0.0% OSDService::queue_for_peering
> >      0.0   0.0% 100.0%     27.2   0.1% OSDService::send_incremental_map
> >      0.0   0.0% 100.0%     27.2   0.1% OSDService::share_map_peer
> >      0.0   0.0% 100.0%      0.0   0.0% OSDService::update_osd_stat
> >      0.0   0.0% 100.0%      0.0   0.0%
> > ObjectStore::Transaction::_get_coll_id
> >      0.0   0.0% 100.0%      0.0   0.0%
> > ObjectStore::Transaction::_get_next_op
> >      0.0   0.0% 100.0%      0.2   0.0% ObjectStore::Transaction::write
> >      0.0   0.0% 100.0%      0.0   0.0% ObjectStore::queue_transaction
> >      0.0   0.0% 100.0%      0.0   0.0% Objecter::_maybe_request_map
> >      0.0   0.0% 100.0%      0.1   0.0% Objecter::handle_osd_map
> >      0.0   0.0% 100.0%      0.1   0.0% OpHistory::insert
> >      0.0   0.0% 100.0%      0.0   0.0% OpRequest::OpRequest
> >      0.0   0.0% 100.0%      0.1   0.0% OpRequest::mark_flag_point
> >      0.0   0.0% 100.0%      0.1   0.0% OpRequest::mark_started
> >      0.0   0.0% 100.0%      0.1   0.0% OpTracker::RemoveOnDelete::operator
> >      0.0   0.0% 100.0%      0.1   0.0% OpTracker::_mark_event
> >      0.0   0.0% 100.0%      0.0   0.0% OpTracker::get_age_ms_histogram
> >      0.0   0.0% 100.0%  25802.4  95.4% PG::RecoveryState::Stray::react
> >      0.0   0.0% 100.0%      0.0   0.0% PG::_prepare_write_info
> >      0.0   0.0% 100.0%  25802.4  95.4% PG::handle_activate_map
> >      0.0   0.0% 100.0%      1.6   0.0% PG::handle_advance_map
> >      0.0   0.0% 100.0%      0.0   0.0% PG::prepare_write_info
> >      0.0   0.0% 100.0%      0.0   0.0% PG::write_if_dirty
> >      0.0   0.0% 100.0%      1.6   0.0% PGPool::update
> >      0.0   0.0% 100.0%      0.0   0.0% PK11_FreeSymKey
> >      0.0   0.0% 100.0%      0.0   0.0% PK11_GetIVLength
> >      0.0   0.0% 100.0%      0.0   0.0% PK11_ImportSymKey
> >      0.0   0.0% 100.0%      0.0   0.0% PrebufferedStreambuf::overflow
> >      0.0   0.0% 100.0%      1.8   0.0% Processor::accept
> >      0.0   0.0% 100.0%      0.0   0.0% SECITEM_CopyItem_Util
> >      0.0   0.0% 100.0%      0.0   0.0% SafeTimer::add_event_after
> >      0.0   0.0% 100.0%      0.0   0.0% SafeTimer::add_event_at
> >      0.0   0.0% 100.0%     38.4   0.1% SafeTimer::timer_thread
> >      0.0   0.0% 100.0%     38.4   0.1% SafeTimerThread::entry
> >      0.0   0.0% 100.0%  25862.7  95.7% ThreadPool::WorkThread::entry
> >      0.0   0.0% 100.0%  25862.7  95.7% ThreadPool::worker
> >      0.0   0.0% 100.0%  27023.8 100.0% __clone
> >      0.0   0.0% 100.0%      0.0   0.0%
> > boost::detail::function::void_function_obj_invoker2::invoke
> >      0.0   0.0% 100.0%      0.0   0.0%
> > boost::proto::detail::default_assign::impl::operator
> >      0.0   0.0% 100.0%      0.0   0.0%
> > boost::spirit::classic::impl::concrete_parser::do_parse_virtual
> >      0.0   0.0% 100.0%      0.0   0.0% boost::spirit::qi::action::parse
> >      0.0   0.0% 100.0%      0.3   0.0%
> > boost::statechart::event_base::intrusive_from_this
> >      0.0   0.0% 100.0%  25802.4  95.4%
> > boost::statechart::simple_state::react_impl
> >      0.0   0.0% 100.0%  25802.4  95.4%
> > boost::statechart::state_machine::send_event
> >      0.0   0.0% 100.0%      0.0   0.0% ceph::Formatter::create@48b620
> >      0.0   0.0% 100.0%      0.4   0.0%
> > ceph::buffer::list::contiguous_appender::contiguous_appender
> >      0.0   0.0% 100.0%      2.4   0.0% ceph::buffer::list::crc32c
> >      0.0   0.0% 100.0%      0.1   0.0%
> > ceph::buffer::list::iterator_impl::copy
> >      0.0   0.0% 100.0%      0.0   0.0%
> > ceph::buffer::list::iterator_impl::copy_deep
> >      0.0   0.0% 100.0%      5.7   0.0%
> > ceph::buffer::list::iterator_impl::copy_shallow
> >      0.0   0.0% 100.0%      0.0   0.0% ceph::buffer::ptr::ptr
> >      0.0   0.0% 100.0%      0.0   0.0% ceph_heap_profiler_handle_command
> >      0.0   0.0% 100.0%      0.0   0.0% ceph_os_fremovexattr
> >      0.0   0.0% 100.0%      0.0   0.0% cephx_verify_authorizer
> >      0.0   0.0% 100.0%      0.0   0.0% cmdmap_from_json
> >      0.0   0.0% 100.0%      2.2   0.0% crush_hash_name
> >      0.0   0.0% 100.0%      0.1   0.0% decode
> >      0.0   0.0% 100.0%     20.1   0.1% entity_addr_t::encode
> >      0.0   0.0% 100.0%      0.0   0.0% get_str_vec
> >      0.0   0.0% 100.0%      0.0   0.0% int decode_decrypt@c15110
> >      0.0   0.0% 100.0%      0.0   0.0% int decode_decrypt@c15b90
> >      0.0   0.0% 100.0%      0.0   0.0%
> > json_spirit::Semantic_actions::new_name
> >      0.0   0.0% 100.0%      0.0   0.0%
> > json_spirit::Semantic_actions::new_str
> >      0.0   0.0% 100.0%      1.1   0.0% json_spirit::Value_impl::get_uint64
> >      0.0   0.0% 100.0%      0.0   0.0% json_spirit::get_str
> >      0.0   0.0% 100.0%      0.0   0.0% json_spirit::get_str_
> >      0.0   0.0% 100.0%      0.0   0.0% json_spirit::read
> >      0.0   0.0% 100.0%      0.0   0.0% json_spirit::read_range
> >      0.0   0.0% 100.0%      0.0   0.0% json_spirit::read_range_or_throw
> >      0.0   0.0% 100.0%      0.0   0.0% json_spirit::substitute_esc_chars
> >      0.0   0.0% 100.0%      0.0   0.0% operator<<@a91e90
> >      0.0   0.0% 100.0%      3.5   0.0% osd_info_t::encode
> >      0.0   0.0% 100.0%      4.4   0.0% osd_xinfo_t::encode
> >      0.0   0.0% 100.0%      0.1   0.0% pg_info_t::decode
> >      0.0   0.0% 100.0%      0.0   0.0% pg_info_t::operator=
> >      0.0   0.0% 100.0%      9.9   0.0% pg_info_t::pg_info_t
> >      0.0   0.0% 100.0%     87.5   0.3% pg_interval_t::decode
> >      0.0   0.0% 100.0%      1.0   0.0% pg_pool_t::decode
> >      0.0   0.0% 100.0%      1.8   0.0% pg_pool_t::encode
> >      0.0   0.0% 100.0%      0.0   0.0% pg_stat_t::decode
> >      0.0   0.0% 100.0%  27032.1 100.0% start_thread
> >      0.0   0.0% 100.0%      1.3   0.0% std::_Rb_tree::operator=
> >      0.0   0.0% 100.0%      0.1   0.0% std::_Sp_counted_base::_M_release
> >      0.0   0.0% 100.0%      0.0   0.0% std::__detail::_Map_base::operator[]
> >      0.0   0.0% 100.0%      0.0   0.0% std::__ostream_insert
> >      0.0   0.0% 100.0%      0.1   0.0% std::basic_streambuf::xsputn
> >      0.0   0.0% 100.0%      0.1   0.0% std::basic_string::basic_string
> >      0.0   0.0% 100.0%      0.0   0.0% std::basic_stringbuf::overflow
> >      0.0   0.0% 100.0%      1.0   0.0% std::basic_stringbuf::str
> >      0.0   0.0% 100.0%     71.0   0.3% std::enable_if::type encode
> >      0.0   0.0% 100.0%      0.1   0.0% std::getline
> >      0.0   0.0% 100.0%      0.0   0.0% std::num_put::_M_insert_int
> >      0.0   0.0% 100.0%      0.0   0.0% std::num_put::do_put
> >      0.0   0.0% 100.0%      0.0   0.0% std::operator<<
> >      0.0   0.0% 100.0%      0.0   0.0% std::ostream::_M_insert
> >      0.0   0.0% 100.0%      1.2   0.0% std::string::_Rep::_M_clone
> >      0.0   0.0% 100.0%      1.2   0.0% std::string::_S_construct
> >      0.0   0.0% 100.0%      1.2   0.0% std::string::append
> >      0.0   0.0% 100.0%      1.2   0.0% std::string::reserve
> >      0.0   0.0% 100.0%    283.5   1.0% std::this_thread::__sleep_for
> >      0.0   0.0% 100.0%      0.0   0.0% void decode_decrypt_enc_bl@c12db0
> >      0.0   0.0% 100.0%      0.0   0.0% void decode_decrypt_enc_bl@c14a80
> >      0.0   0.0% 100.0%      0.0   0.0% void decode_decrypt_enc_bl@c15450
> >      0.0   0.0% 100.0%     20.1   0.1% void encode
> >
> >
> > I also generated the PDf with all the charts, but not sure how to share it
> > with you guys.
> > any Idea what is happening here ?
> >
> > thanks
> > ali
> >
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: High memory usage kills OSD while peering
  2017-08-22 22:33       ` Sage Weil
@ 2017-08-23 12:21         ` Linux Chips
  2017-08-23 13:46           ` Sage Weil
  0 siblings, 1 reply; 27+ messages in thread
From: Linux Chips @ 2017-08-23 12:21 UTC (permalink / raw)
  To: Sage Weil, Mustafa Muhammad; +Cc: ceph-devel

On 08/23/2017 01:33 AM, Sage Weil wrote:
> One other trick that has been used here: if you look inside the PG
> directories on the OSDs and find that they are mostly empty then it's
> possible some of the memory and peering overhead is related to
> empty and useless PG instances on the wrong OSDs.  You can write a script
> to find empty directories (or ones that only contain the single pgmeta
> object with a mostly-empty name) and remove them (using
> ceph-objectstore-tool).  (For safety I'd recommend doing
> ceph-objectstore-tool export first, just in case there is some useful
> metadata there.)
> 
> That will only help if most of the pg dirs look empty, though.  If so,
> it's worth a shot!
> 
> The other thing we once did was use a kludge patch to trim the
> past_intervals metadata, which was respnosible for most of the memory
> usage.  I can't tell from the profile in this thread if that is the case
> or not.  There is a patch floating around in git somewhere that can be
> reused if it looks like that is the thing consuming the memory.
> 
> sage
> 
> 

we ll try the empty pg search. not sure how much is there, but i 
randomly checked and found a few.

as for the "kludge" patch, where can I find it. I searched the git repo, 
but could not identify it. did not know what to look for specifically.
also, what would we need to better know if the patch would be useful?
e.g. if we need another/more mem profiling.

we installed a test cluster of 4 nodes and replicated the issue there, 
and we are testing various scenarios there. if any one cares to 
replicate it i can elaborate on the steps.

if all failed, do you think moving pgs out of the current dir is safe? 
we are trying to test it, but we ll never be sure 100%

thanks
ali



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: High memory usage kills OSD while peering
  2017-08-23 12:21         ` Linux Chips
@ 2017-08-23 13:46           ` Sage Weil
  2017-08-23 15:27             ` Linux Chips
  0 siblings, 1 reply; 27+ messages in thread
From: Sage Weil @ 2017-08-23 13:46 UTC (permalink / raw)
  To: Linux Chips; +Cc: Mustafa Muhammad, ceph-devel

On Wed, 23 Aug 2017, Linux Chips wrote:
> On 08/23/2017 01:33 AM, Sage Weil wrote:
> > One other trick that has been used here: if you look inside the PG
> > directories on the OSDs and find that they are mostly empty then it's
> > possible some of the memory and peering overhead is related to
> > empty and useless PG instances on the wrong OSDs.  You can write a script
> > to find empty directories (or ones that only contain the single pgmeta
> > object with a mostly-empty name) and remove them (using
> > ceph-objectstore-tool).  (For safety I'd recommend doing
> > ceph-objectstore-tool export first, just in case there is some useful
> > metadata there.)
> > 
> > That will only help if most of the pg dirs look empty, though.  If so,
> > it's worth a shot!
> > 
> > The other thing we once did was use a kludge patch to trim the
> > past_intervals metadata, which was respnosible for most of the memory
> > usage.  I can't tell from the profile in this thread if that is the case
> > or not.  There is a patch floating around in git somewhere that can be
> > reused if it looks like that is the thing consuming the memory.
> > 
> > sage
> > 
> > 
> 
> we ll try the empty pg search. not sure how much is there, but i randomly
> checked and found a few.
> 
> as for the "kludge" patch, where can I find it. I searched the git repo, but
> could not identify it. did not know what to look for specifically.
> also, what would we need to better know if the patch would be useful?
> e.g. if we need another/more mem profiling.

I found and rebased the branch, but until we have some confidence this is 
the problem I wouldn't use it.
 
> we installed a test cluster of 4 nodes and replicated the issue there, and we
> are testing various scenarios there. if any one cares to replicate it i can
> elaborate on the steps.

How were you able to reproduce the situation?

> if all failed, do you think moving pgs out of the current dir is safe? we are
> trying to test it, but we ll never be sure 100%

It is safe if you use ceph-objectstore-tool export and then remove.  Do 
not just move the directory around as that will leave behind all kinds of 
random state in leveldb!

sage

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: High memory usage kills OSD while peering
  2017-08-23 13:46           ` Sage Weil
@ 2017-08-23 15:27             ` Linux Chips
  2017-08-24  3:58               ` Sage Weil
  0 siblings, 1 reply; 27+ messages in thread
From: Linux Chips @ 2017-08-23 15:27 UTC (permalink / raw)
  To: Sage Weil; +Cc: Mustafa Muhammad, ceph-devel

On 08/23/2017 04:46 PM, Sage Weil wrote:
> On Wed, 23 Aug 2017, Linux Chips wrote:
>> On 08/23/2017 01:33 AM, Sage Weil wrote:
>>> One other trick that has been used here: if you look inside the PG
>>> directories on the OSDs and find that they are mostly empty then it's
>>> possible some of the memory and peering overhead is related to
>>> empty and useless PG instances on the wrong OSDs.  You can write a script
>>> to find empty directories (or ones that only contain the single pgmeta
>>> object with a mostly-empty name) and remove them (using
>>> ceph-objectstore-tool).  (For safety I'd recommend doing
>>> ceph-objectstore-tool export first, just in case there is some useful
>>> metadata there.)
>>>
>>> That will only help if most of the pg dirs look empty, though.  If so,
>>> it's worth a shot!
>>>
>>> The other thing we once did was use a kludge patch to trim the
>>> past_intervals metadata, which was respnosible for most of the memory
>>> usage.  I can't tell from the profile in this thread if that is the case
>>> or not.  There is a patch floating around in git somewhere that can be
>>> reused if it looks like that is the thing consuming the memory.
>>>
>>> sage
>>>
>>>
>>
>> we ll try the empty pg search. not sure how much is there, but i randomly
>> checked and found a few.
>>
>> as for the "kludge" patch, where can I find it. I searched the git repo, but
>> could not identify it. did not know what to look for specifically.
>> also, what would we need to better know if the patch would be useful?
>> e.g. if we need another/more mem profiling.
> 
> I found and rebased the branch, but until we have some confidence this is
> the problem I wouldn't use it.
>   
>> we installed a test cluster of 4 nodes and replicated the issue there, and we
>> are testing various scenarios there. if any one cares to replicate it i can
>> elaborate on the steps.
> 
> How were you able to reproduce the situation?

we deployed the cluster normally.
create some profiles:

ceph osd erasure-code-profile set k3m1 k=3 m=1
ceph osd erasure-code-profile set k9m3 k=9 m=3
ceph osd erasure-code-profile set k6m2 k=6 m=2
ceph osd erasure-code-profile set k12m4 k=12 m=4

then edit the rule set manually for them so the (step chooseleaf indep 0 
type osd) instead of host. this is necessary because it is a 4 osd node 
cluster.

add some pools:

ceph osd pool create testk9m3 1024 1024 erasure k9m3
ceph osd pool create testk12m4 1024 1024 erasure k12m4
ceph osd pool create testk6m2 1024 1024 erasure k6m2
ceph osd pool create testk3m1 1024 1024 erasure k3m1

fill it with some data using rados bench, we let it running for about 
2-3 hours, until we had about 2000+ kobject, more is better.

rados bench -p testk3m1 72000 write -t 256 -b 4K --no-cleanup
rados bench -p testk6m2 72000 write -t 256 -b 4K --no-cleanup
rados bench -p rbd 72000 write -t 256 -b 4K --no-cleanup
rados bench -p testk12m4 7200 write -t 256 -b 1M --no-cleanup


then we start messing with the placement of the hosts change there racks 
a couple of times, set OSDs down randomly, and randomly restarting them 
(the idea is to simulate a long unhealthy cluster, with a lot of osdmaps)

while true ; do i=$(( ( RANDOM % 48 ) )) ; ./restartOSD \
$(ceph osd tree | grep up | tail -n $i | head -1 | awk '{print $1}' ) \
; done

restartOSD is a script:
#!/bin/bash
osd=$1
IP=$(./findOSD $osd)
echo "sshing to $IP";
ssh $IP "systemctl restart ceph-osd@$osd";

findOSD is:
#!/bin/bash
if [[ $1 =~ ^[1-9]* ]]; then
ceph osd find $1 | grep ip | sed -e 's/.*\": \"\(.*\):.*/\1/'
#ceph osd tree | grep -e "host\|osd.$1 " | grep osd.$1 -B1
else
echo "Usage $0 OSDNUM"
fi

we also set a memory limit in systemd unit file, so the oom kills them 
and make things go faster. we put this
[Service]
MemoryLimit=2G

inside "/etc/systemd/system/ceph-osd@.service.d/memory.conf"
we start with some thing like 2GB and increase it when ever we feel the 
limit is too harsh. by the time we reach 10GB limit, things are pretty 
ugly though (which, oddly, is good).

after a while the mon store will grow bigger and bigger. and the amount 
of ram consumed will grow too.
the target is for the status of the OSDs
ceph daemon osd.xx status
will give a difference between the oldest and newest map of about 
20000-40000 epoch.

at this point, we stop the "restart script" and the "rados bench". if we 
restart all the OSDs, they will consume all the RAM in the node. and 
either the oom will be fast enough to kill them, or the whole node will 
die. so we usually put the memory limit in the unit file at about 20-30 
GB at this point so we do not loose the node.

> 
>> if all failed, do you think moving pgs out of the current dir is safe? we are
>> trying to test it, but we ll never be sure 100%
> 
> It is safe if you use ceph-objectstore-tool export and then remove.  Do
> not just move the directory around as that will leave behind all kinds of
> random state in leveldb!
> 
> sage
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: High memory usage kills OSD while peering
  2017-08-23 15:27             ` Linux Chips
@ 2017-08-24  3:58               ` Sage Weil
  2017-08-25 22:25                 ` Linux Chips
  0 siblings, 1 reply; 27+ messages in thread
From: Sage Weil @ 2017-08-24  3:58 UTC (permalink / raw)
  To: Linux Chips; +Cc: Mustafa Muhammad, ceph-devel

On Wed, 23 Aug 2017, Linux Chips wrote:
> On 08/23/2017 04:46 PM, Sage Weil wrote:
> > On Wed, 23 Aug 2017, Linux Chips wrote:
> > > On 08/23/2017 01:33 AM, Sage Weil wrote:
> > > > One other trick that has been used here: if you look inside the PG
> > > > directories on the OSDs and find that they are mostly empty then it's
> > > > possible some of the memory and peering overhead is related to
> > > > empty and useless PG instances on the wrong OSDs.  You can write a
> > > > script
> > > > to find empty directories (or ones that only contain the single pgmeta
> > > > object with a mostly-empty name) and remove them (using
> > > > ceph-objectstore-tool).  (For safety I'd recommend doing
> > > > ceph-objectstore-tool export first, just in case there is some useful
> > > > metadata there.)
> > > > 
> > > > That will only help if most of the pg dirs look empty, though.  If so,
> > > > it's worth a shot!
> > > > 
> > > > The other thing we once did was use a kludge patch to trim the
> > > > past_intervals metadata, which was respnosible for most of the memory
> > > > usage.  I can't tell from the profile in this thread if that is the case
> > > > or not.  There is a patch floating around in git somewhere that can be
> > > > reused if it looks like that is the thing consuming the memory.
> > > > 
> > > > sage
> > > > 
> > > > 
> > > 
> > > we ll try the empty pg search. not sure how much is there, but i randomly
> > > checked and found a few.
> > > 
> > > as for the "kludge" patch, where can I find it. I searched the git repo,
> > > but
> > > could not identify it. did not know what to look for specifically.
> > > also, what would we need to better know if the patch would be useful?
> > > e.g. if we need another/more mem profiling.
> > 
> > I found and rebased the branch, but until we have some confidence this is
> > the problem I wouldn't use it.
> >   
> > > we installed a test cluster of 4 nodes and replicated the issue there, and
> > > we
> > > are testing various scenarios there. if any one cares to replicate it i
> > > can
> > > elaborate on the steps.
> > 
> > How were you able to reproduce the situation?
> 
> we deployed the cluster normally.
> create some profiles:
> 
> ceph osd erasure-code-profile set k3m1 k=3 m=1
> ceph osd erasure-code-profile set k9m3 k=9 m=3
> ceph osd erasure-code-profile set k6m2 k=6 m=2
> ceph osd erasure-code-profile set k12m4 k=12 m=4
> 
> then edit the rule set manually for them so the (step chooseleaf indep 0 type
> osd) instead of host. this is necessary because it is a 4 osd node cluster.
> 
> add some pools:
> 
> ceph osd pool create testk9m3 1024 1024 erasure k9m3
> ceph osd pool create testk12m4 1024 1024 erasure k12m4
> ceph osd pool create testk6m2 1024 1024 erasure k6m2
> ceph osd pool create testk3m1 1024 1024 erasure k3m1
> 
> fill it with some data using rados bench, we let it running for about 2-3
> hours, until we had about 2000+ kobject, more is better.
> 
> rados bench -p testk3m1 72000 write -t 256 -b 4K --no-cleanup
> rados bench -p testk6m2 72000 write -t 256 -b 4K --no-cleanup
> rados bench -p rbd 72000 write -t 256 -b 4K --no-cleanup
> rados bench -p testk12m4 7200 write -t 256 -b 1M --no-cleanup
> 
> 
> then we start messing with the placement of the hosts change there racks a
> couple of times, set OSDs down randomly, and randomly restarting them (the
> idea is to simulate a long unhealthy cluster, with a lot of osdmaps)
> 
> while true ; do i=$(( ( RANDOM % 48 ) )) ; ./restartOSD \
> $(ceph osd tree | grep up | tail -n $i | head -1 | awk '{print $1}' ) \
> ; done
> 
> restartOSD is a script:
> #!/bin/bash
> osd=$1
> IP=$(./findOSD $osd)
> echo "sshing to $IP";
> ssh $IP "systemctl restart ceph-osd@$osd";
> 
> findOSD is:
> #!/bin/bash
> if [[ $1 =~ ^[1-9]* ]]; then
> ceph osd find $1 | grep ip | sed -e 's/.*\": \"\(.*\):.*/\1/'
> #ceph osd tree | grep -e "host\|osd.$1 " | grep osd.$1 -B1
> else
> echo "Usage $0 OSDNUM"
> fi
> 
> we also set a memory limit in systemd unit file, so the oom kills them and
> make things go faster. we put this
> [Service]
> MemoryLimit=2G
> 
> inside "/etc/systemd/system/ceph-osd@.service.d/memory.conf"
> we start with some thing like 2GB and increase it when ever we feel the limit
> is too harsh. by the time we reach 10GB limit, things are pretty ugly though
> (which, oddly, is good).
> 
> after a while the mon store will grow bigger and bigger. and the amount of ram
> consumed will grow too.
> the target is for the status of the OSDs
> ceph daemon osd.xx status
> will give a difference between the oldest and newest map of about 20000-40000
> epoch.
> 
> at this point, we stop the "restart script" and the "rados bench". if we
> restart all the OSDs, they will consume all the RAM in the node. and either
> the oom will be fast enough to kill them, or the whole node will die. so we
> usually put the memory limit in the unit file at about 20-30 GB at this point
> so we do not loose the node.

Okay, so I think the combination of (1) removing empty PGs and (2) pruning 
past_intervals will help.  (1) can be scripted by looking in 
current/$pg_HEAD directories and picking out the ones with 0 or 1 objects 
in them, doing ceph-objecstore-tool export to make a backup (just in 
case), and then removing them (with ceph-objectstore-tool).  Be careful of 
PGs for empty pools since those will be naturally empty (and you want 
to keep them).

For (2), see the wip-prune-past-intervals-jewel branch in ceph-ci.git.. if 
that is applied to the kraken branch it ought ot work (although it's 
untested).  Alternatively, you can just upgrade to luminous, as it 
implements a more sophisticated version of the same thing.  You need to 
upgrade mons, mark all osds down, upgrade osds and start at least one of 
them, and then set 'ceph osd require-osd-release luminous' before it'll 
switch to the new past intervals representation.  Definitely test it on 
your test cluster to ensure it reduces the memory usage!

If that doesn't sort things out we'll need to see a heap profile for an 
OOMing OSD to make sure we know what is using all of the RAM...

sage

 > 
> > 
> > > if all failed, do you think moving pgs out of the current dir is safe? we
> > > are
> > > trying to test it, but we ll never be sure 100%
> > 
> > It is safe if you use ceph-objectstore-tool export and then remove.  Do
> > not just move the directory around as that will leave behind all kinds of
> > random state in leveldb!
> > 
> > sage
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: High memory usage kills OSD while peering
  2017-08-24  3:58               ` Sage Weil
@ 2017-08-25 22:25                 ` Linux Chips
  2017-08-25 22:46                   ` Sage Weil
  0 siblings, 1 reply; 27+ messages in thread
From: Linux Chips @ 2017-08-25 22:25 UTC (permalink / raw)
  To: Sage Weil; +Cc: Mustafa Muhammad, ceph-devel

On 08/24/2017 06:58 AM, Sage Weil wrote:
> 
> Okay, so I think the combination of (1) removing empty PGs and (2) pruning
> past_intervals will help.  (1) can be scripted by looking in
> current/$pg_HEAD directories and picking out the ones with 0 or 1 objects
> in them, doing ceph-objecstore-tool export to make a backup (just in
> case), and then removing them (with ceph-objectstore-tool).  Be careful of
> PGs for empty pools since those will be naturally empty (and you want
> to keep them).
> 
> For (2), see the wip-prune-past-intervals-jewel branch in ceph-ci.git.. if
> that is applied to the kraken branch it ought ot work (although it's
> untested).  Alternatively, you can just upgrade to luminous, as it
> implements a more sophisticated version of the same thing.  You need to
> upgrade mons, mark all osds down, upgrade osds and start at least one of
> them, and then set 'ceph osd require-osd-release luminous' before it'll
> switch to the new past intervals representation.  Definitely test it on
> your test cluster to ensure it reduces the memory usage!
> 
> If that doesn't sort things out we'll need to see a heap profile for an
> OOMing OSD to make sure we know what is using all of the RAM...
> 
> sage
> 

well, big thank you sage. we tested the upgrade on the test cluster. and 
it did work in an awesome way. and we did not even needed to remove any 
pg with the tool.
I read about the prune thing in the release notes, so when our attempts 
failed to start the cluster, we tried upgrading, but it did not help. it 
turned out that we missed the 'ceph osd require-osd-release luminous' 
thing. I mean we was looking on the command in the release notes upgrade 
section, and said to each other "it dose not matter, it would only 
restrict the old osds from joining" and moved on. damn, we would be up a 
week ago.
having said that, I think the release notes should highlight this in the 
future.

now we have upgraded the production cluster, and it is up and running 
now, memory foot print was down to the tenth. the largest ram using osd 
i saw was about 6.5GB.
but we faced some issues, particularly OSDs crashing with "FAILED 
assert(interval.last > last)"


logs:

    -34> 2017-08-26 00:38:00.505114 7f14556b4d00  0 osd.299 1085665 
load_pgs opened 455 pgs
    -33> 2017-08-26 00:38:00.505787 7f14556b4d00 10 osd.299 1085665 
19.f1e needs 1050342-1085230
    -32> 2017-08-26 00:38:00.505814 7f14556b4d00  1 osd.299 1085665 
build_past_intervals_parallel over 1050342-1085230
    -31> 2017-08-26 00:38:00.505818 7f14556b4d00 10 osd.299 1085665 
build_past_intervals_parallel epoch 1050342
    -30> 2017-08-26 00:38:00.505824 7f14556b4d00 20 osd.299 0 get_map 
1050342 - loading and decoding 0x7f14b3dfb0c0
    -29> 2017-08-26 00:38:00.506245 7f14556b4d00 10 osd.299 0 add_map_bl 
1050342 780781 bytes
    -28> 2017-08-26 00:38:00.508539 7f14556b4d00 10 osd.299 1085665 
build_past_intervals_parallel epoch 1050342 pg 19.f1e first map, acting 
[80] up [80], same_interval_since = 1050342
    -27> 2017-08-26 00:38:00.508547 7f14556b4d00 10 osd.299 1085665 
build_past_intervals_parallel epoch 1050343
    -26> 2017-08-26 00:38:00.508550 7f14556b4d00 20 osd.299 0 get_map 
1050343 - loading and decoding 0x7f14b3dfad80
    -25> 2017-08-26 00:38:00.508997 7f14556b4d00 10 osd.299 0 add_map_bl 
1050343 781371 bytes
    -24> 2017-08-26 00:38:00.511176 7f14556b4d00 10 osd.299 1085665 
build_past_intervals_parallel epoch 1050344
    -23> 2017-08-26 00:38:00.511196 7f14556b4d00 20 osd.299 0 get_map 
1050344 - loading and decoding 0x7f14b3dfb740
    -22> 2017-08-26 00:38:00.511625 7f14556b4d00 10 osd.299 0 add_map_bl 
1050344 782446 bytes
    -21> 2017-08-26 00:38:00.513813 7f14556b4d00 10 osd.299 1085665 
build_past_intervals_parallel epoch 1050345
    -20> 2017-08-26 00:38:00.513820 7f14556b4d00 20 osd.299 0 get_map 
1050345 - loading and decoding 0x7f14b3dfba80
    -19> 2017-08-26 00:38:00.514260 7f14556b4d00 10 osd.299 0 add_map_bl 
1050345 782071 bytes
    -18> 2017-08-26 00:38:00.516463 7f14556b4d00 10 osd.299 1085665 
build_past_intervals_parallel epoch 1050346
    -17> 2017-08-26 00:38:00.516488 7f14556b4d00 20 osd.299 0 get_map 
1050346 - loading and decoding 0x7f14b79c4000
    -16> 2017-08-26 00:38:00.516927 7f14556b4d00 10 osd.299 0 add_map_bl 
1050346 781955 bytes
    -15> 2017-08-26 00:38:00.519047 7f14556b4d00 10 osd.299 1085665 
build_past_intervals_parallel epoch 1050347
    -14> 2017-08-26 00:38:00.519054 7f14556b4d00 20 osd.299 0 get_map 
1050347 - loading and decoding 0x7f14b79c4340
    -13> 2017-08-26 00:38:00.519500 7f14556b4d00 10 osd.299 0 add_map_bl 
1050347 781930 bytes
    -12> 2017-08-26 00:38:00.521612 7f14556b4d00 10 osd.299 1085665 
build_past_intervals_parallel epoch 1050348
    -11> 2017-08-26 00:38:00.521619 7f14556b4d00 20 osd.299 0 get_map 
1050348 - loading and decoding 0x7f14b79c4680
    -10> 2017-08-26 00:38:00.522074 7f14556b4d00 10 osd.299 0 add_map_bl 
1050348 784883 bytes
     -9> 2017-08-26 00:38:00.524245 7f14556b4d00 10 osd.299 1085665 
build_past_intervals_parallel epoch 1050349
     -8> 2017-08-26 00:38:00.524252 7f14556b4d00 20 osd.299 0 get_map 
1050349 - loading and decoding 0x7f14b79c49c0
     -7> 2017-08-26 00:38:00.524706 7f14556b4d00 10 osd.299 0 add_map_bl 
1050349 785081 bytes
     -6> 2017-08-26 00:38:00.526854 7f14556b4d00 10 osd.299 1085665 
build_past_intervals_parallel epoch 1050350
     -5> 2017-08-26 00:38:00.526861 7f14556b4d00 20 osd.299 0 get_map 
1050350 - loading and decoding 0x7f14b79c4d00
     -4> 2017-08-26 00:38:00.527330 7f14556b4d00 10 osd.299 0 add_map_bl 
1050350 785948 bytes
     -3> 2017-08-26 00:38:00.529505 7f14556b4d00 10 osd.299 1085665 
build_past_intervals_parallel epoch 1050351
     -2> 2017-08-26 00:38:00.529512 7f14556b4d00 20 osd.299 0 get_map 
1050351 - loading and decoding 0x7f14b79c5040
     -1> 2017-08-26 00:38:00.529979 7f14556b4d00 10 osd.299 0 add_map_bl 
1050351 788650 bytes
      0> 2017-08-26 00:38:00.534373 7f14556b4d00 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.4/rpm/el7/BUILD/ceph-12.1.4/src/osd/osd_types.cc: 
In function 'virtual void pi_compact_rep::add_interval(bool, const 
PastIntervals::pg_interval_t&)' thread 7f14556b4d00 time 2017-08-26 
00:38:00.532119
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.4/rpm/el7/BUILD/ceph-12.1.4/src/osd/osd_types.cc: 
3205: FAILED assert(interval.last > last)

  ceph version 12.1.4 (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) 
luminous (rc)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x110) [0x7f145612f420]
  2: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t 
const&)+0x3b2) [0x7f1455e030b2]
  3: (PastIntervals::check_new_interval(int, int, std::vector<int, 
std::allocator<int> > const&, std::vector<int, std::allocator<int> > 
const&, int, int, std::vector<int, std::allocator<int> > const&, 
std::vector<int, std::allocator<int> > const&, unsigned int, unsigned 
int, std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t, 
IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380) 
[0x7f1455de8ab0]
  4: (OSD::build_past_intervals_parallel()+0xa8f) [0x7f1455bbc71f]
  5: (OSD::load_pgs()+0x503) [0x7f1455bbef13]
  6: (OSD::init()+0x2179) [0x7f1455bd7779]
  7: (main()+0x2def) [0x7f1455add56f]
  8: (__libc_start_main()+0xf5) [0x7f1451d14b35]
  9: (()+0x4ac8a6) [0x7f1455b7b8a6]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

--- logging levels ---
    0/ 5 none
    0/ 1 lockdep
    0/ 1 context
    1/ 1 crush
    1/ 5 mds
    1/ 5 mds_balancer
    1/ 5 mds_locker
    1/ 5 mds_log
    1/ 5 mds_log_expire
    1/ 5 mds_migrator
    0/ 1 buffer
    0/ 1 timer
    0/ 1 filer
    0/ 1 striper
    0/ 1 objecter
    0/ 5 rados
    0/ 5 rbd
    0/ 5 rbd_mirror
    0/ 5 rbd_replay
    0/ 5 journaler
    0/ 5 objectcacher
    0/ 5 client
   20/20 osd
    0/ 5 optracker
    0/ 5 objclass
    1/ 3 filestore
    1/ 3 journal
    0/ 5 ms
    1/ 5 mon
    0/10 monc
    1/ 5 paxos
    0/ 5 tp
    1/ 5 auth
    1/ 5 crypto
    1/ 1 finisher
    1/ 5 heartbeatmap
    1/ 5 perfcounter
    1/ 5 rgw
    1/10 civetweb
    1/ 5 javaclient
    1/ 5 asok
    1/ 1 throttle
    0/ 0 refs
    1/ 5 xio
    1/ 5 compressor
    1/ 5 bluestore
    1/ 5 bluefs
    1/ 3 bdev
    1/ 5 kstore
    4/ 5 rocksdb
    4/ 5 leveldb
    4/ 5 memdb
    1/ 5 kinetic
    1/ 5 fuse
    1/ 5 mgr
    1/ 5 mgrc
    1/ 5 dpdk
    1/ 5 eventtrace
   -2/-2 (syslog threshold)
   -1/-1 (stderr threshold)
   max_recent     10000
   max_new         1000
   log_file /var/log/ceph/ceph-osd.299.log
--- end dump of recent events ---
2017-08-26 00:38:00.572479 7f14556b4d00 -1 *** Caught signal (Aborted) **
  in thread 7f14556b4d00 thread_name:ceph-osd

  ceph version 12.1.4 (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) 
luminous (rc)
  1: (()+0xa21a01) [0x7f14560f0a01]
  2: (()+0xf370) [0x7f1452cfe370]
  3: (gsignal()+0x37) [0x7f1451d281d7]
  4: (abort()+0x148) [0x7f1451d298c8]
  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x284) [0x7f145612f594]
  6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t 
const&)+0x3b2) [0x7f1455e030b2]
  7: (PastIntervals::check_new_interval(int, int, std::vector<int, 
std::allocator<int> > const&, std::vector<int, std::allocator<int> > 
const&, int, int, std::vector<int, std::allocator<int> > const&, 
std::vector<int, std::allocator<int> > const&, unsigned int, unsigned 
int, std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t, 
IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380) 
[0x7f1455de8ab0]
  8: (OSD::build_past_intervals_parallel()+0xa8f) [0x7f1455bbc71f]
  9: (OSD::load_pgs()+0x503) [0x7f1455bbef13]
  10: (OSD::init()+0x2179) [0x7f1455bd7779]
  11: (main()+0x2def) [0x7f1455add56f]
  12: (__libc_start_main()+0xf5) [0x7f1451d14b35]
  13: (()+0x4ac8a6) [0x7f1455b7b8a6]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

--- begin dump of recent events ---
      0> 2017-08-26 00:38:00.572479 7f14556b4d00 -1 *** Caught signal 
(Aborted) **
  in thread 7f14556b4d00 thread_name:ceph-osd

  ceph version 12.1.4 (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) 
luminous (rc)
  1: (()+0xa21a01) [0x7f14560f0a01]
  2: (()+0xf370) [0x7f1452cfe370]
  3: (gsignal()+0x37) [0x7f1451d281d7]
  4: (abort()+0x148) [0x7f1451d298c8]
  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x284) [0x7f145612f594]
  6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t 
const&)+0x3b2) [0x7f1455e030b2]
  7: (PastIntervals::check_new_interval(int, int, std::vector<int, 
std::allocator<int> > const&, std::vector<int, std::allocator<int> > 
const&, int, int, std::vector<int, std::allocator<int> > const&, 
std::vector<int, std::allocator<int> > const&, unsigned int, unsigned 
int, std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t, 
IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380) 
[0x7f1455de8ab0]
  8: (OSD::build_past_intervals_parallel()+0xa8f) [0x7f1455bbc71f]
  9: (OSD::load_pgs()+0x503) [0x7f1455bbef13]
  10: (OSD::init()+0x2179) [0x7f1455bd7779]
  11: (main()+0x2def) [0x7f1455add56f]
  12: (__libc_start_main()+0xf5) [0x7f1451d14b35]
  13: (()+0x4ac8a6) [0x7f1455b7b8a6]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

--- logging levels ---
    0/ 5 none
    0/ 1 lockdep
    0/ 1 context
    1/ 1 crush
    1/ 5 mds
    1/ 5 mds_balancer
    1/ 5 mds_locker
    1/ 5 mds_log
    1/ 5 mds_log_expire
    1/ 5 mds_migrator
    0/ 1 buffer
    0/ 1 timer
    0/ 1 filer
    0/ 1 striper
    0/ 1 objecter
    0/ 5 rados
    0/ 5 rbd
    0/ 5 rbd_mirror
    0/ 5 rbd_replay
    0/ 5 journaler
    0/ 5 objectcacher
    0/ 5 client
   20/20 osd
    0/ 5 optracker
    0/ 5 objclass
    1/ 3 filestore
    1/ 3 journal
    0/ 5 ms
    1/ 5 mon
    0/10 monc
    1/ 5 paxos
    0/ 5 tp
    1/ 5 auth
    1/ 5 crypto
    1/ 1 finisher
    1/ 5 heartbeatmap
    1/ 5 perfcounter
    1/ 5 rgw
    1/10 civetweb
    1/ 5 javaclient
    1/ 5 asok
    1/ 1 throttle
    0/ 0 refs
    1/ 5 xio
    1/ 5 compressor
    1/ 5 bluestore
    1/ 5 bluefs
    1/ 3 bdev
    1/ 5 kstore
    4/ 5 rocksdb
    4/ 5 leveldb
    4/ 5 memdb
    1/ 5 kinetic
    1/ 5 fuse
    1/ 5 mgr
    1/ 5 mgrc
    1/ 5 dpdk
    1/ 5 eventtrace
   -2/-2 (syslog threshold)
   -1/-1 (stderr threshold)
   max_recent     10000
   max_new         1000
   log_file /var/log/ceph/ceph-osd.299.log
--- end dump of recent events ---


ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-299/ --op info 
--pgid 19.f1e
{
     "pgid": "19.f1e",
     "last_update": "0'0",
     "last_complete": "0'0",
     "log_tail": "0'0",
     "last_user_version": 0,
     "last_backfill": "MAX",
     "last_backfill_bitwise": 0,
     "purged_snaps": [],
     "history": {
         "epoch_created": 1084817,
         "epoch_pool_created": 1084817,
         "last_epoch_started": 1085232,
         "last_interval_started": 1085230,
         "last_epoch_clean": 1050342,
         "last_interval_clean": 1050342,
         "last_epoch_split": 0,
         "last_epoch_marked_full": 1061015,
         "same_up_since": 1085230,
         "same_interval_since": 1085230,
         "same_primary_since": 1085230,
         "last_scrub": "960114'865853",
         "last_scrub_stamp": "2017-08-25 17:32:06.181006",
         "last_deep_scrub": "952725'861179",
         "last_deep_scrub_stamp": "2017-08-25 17:32:06.181006",
         "last_clean_scrub_stamp": "2017-08-25 17:32:06.181006"
     },
     "stats": {
         "version": "0'0",
         "reported_seq": "424",
         "reported_epoch": "1085650",
         "state": "active+undersized+degraded",
         "last_fresh": "2017-08-25 18:52:46.520078",
         "last_change": "2017-08-25 18:38:16.356266",
         "last_active": "2017-08-25 18:52:46.520078",
         "last_peered": "2017-08-25 18:52:46.520078",
         "last_clean": "2017-08-25 17:32:06.181006",
         "last_became_active": "2017-08-25 18:38:16.356266",
         "last_became_peered": "2017-08-25 18:38:16.356266",
         "last_unstale": "2017-08-25 18:52:46.520078",
         "last_undegraded": "2017-08-25 18:38:16.304877",
         "last_fullsized": "2017-08-25 18:38:16.304877",
         "mapping_epoch": 1085230,
         "log_start": "0'0",
         "ondisk_log_start": "0'0",
         "created": 1084817,
         "last_epoch_clean": 1050342,
         "parent": "0.0",
         "parent_split_bits": 0,
         "last_scrub": "960114'865853",
         "last_scrub_stamp": "2017-08-25 17:32:06.181006",
         "last_deep_scrub": "952725'861179",
         "last_deep_scrub_stamp": "2017-08-25 17:32:06.181006",
         "last_clean_scrub_stamp": "2017-08-25 17:32:06.181006",
         "log_size": 0,
         "ondisk_log_size": 0,
         "stats_invalid": false,
         "dirty_stats_invalid": false,
         "omap_stats_invalid": false,
         "hitset_stats_invalid": false,
         "hitset_bytes_stats_invalid": false,
         "pin_stats_invalid": false,
         "stat_sum": {
             "num_bytes": 0,
             "num_objects": 0,
             "num_object_clones": 0,
             "num_object_copies": 0,
             "num_objects_missing_on_primary": 0,
             "num_objects_missing": 0,
             "num_objects_degraded": 0,
             "num_objects_misplaced": 0,
             "num_objects_unfound": 0,
             "num_objects_dirty": 0,
             "num_whiteouts": 0,
             "num_read": 0,
             "num_read_kb": 0,
             "num_write": 0,
             "num_write_kb": 0,
             "num_scrub_errors": 0,
             "num_shallow_scrub_errors": 0,
             "num_deep_scrub_errors": 0,
             "num_objects_recovered": 0,
             "num_bytes_recovered": 0,
             "num_keys_recovered": 0,
             "num_objects_omap": 0,
             "num_objects_hit_set_archive": 0,
             "num_bytes_hit_set_archive": 0,
             "num_flush": 0,
             "num_flush_kb": 0,
             "num_evict": 0,
             "num_evict_kb": 0,
             "num_promote": 0,
             "num_flush_mode_high": 0,
             "num_flush_mode_low": 0,
             "num_evict_mode_some": 0,
             "num_evict_mode_full": 0,
             "num_objects_pinned": 0,
             "num_legacy_snapsets": 0
         },
         "up": [
             299
         ],
         "acting": [
             299
         ],
         "blocked_by": [],
         "up_primary": 299,
         "acting_primary": 299
     },
     "empty": 1,
     "dne": 0,
     "incomplete": 0,
     "last_epoch_started": 1085232,
     "hit_set_history": {
         "current_last_update": "0'0",
         "history": []
     }
}


ll /var/lib/ceph/osd/ceph-299/current/19.f1e_head/
total 0
-rw-r--r-- 1 root root 0 Aug 25 18:38 __head_00000F1E__13

ceph pg 19.f1e query
.
.
             "blocked": "peering is blocked due to down osds",
             "down_osds_we_would_probe": [
                 299
             ],
             "peering_blocked_by": [
                 {
                     "osd": 299,
                     "current_lost_at": 0,
                     "comment": "starting or marking this osd lost may 
let us proceed"
                 }
             ]
.
.
.



removing the pg with ceph-objectstore-tool did the trick. but i am not 
sure if that will happen to a pg with real data in it or not.
should i report this in the bug tracker?

thanks

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: High memory usage kills OSD while peering
  2017-08-25 22:25                 ` Linux Chips
@ 2017-08-25 22:46                   ` Sage Weil
  2017-08-25 22:49                     ` Sage Weil
  2017-08-25 23:03                     ` Linux Chips
  0 siblings, 2 replies; 27+ messages in thread
From: Sage Weil @ 2017-08-25 22:46 UTC (permalink / raw)
  To: Linux Chips; +Cc: Mustafa Muhammad, ceph-devel

On Sat, 26 Aug 2017, Linux Chips wrote:
> On 08/24/2017 06:58 AM, Sage Weil wrote:
> > 
> > Okay, so I think the combination of (1) removing empty PGs and (2) pruning
> > past_intervals will help.  (1) can be scripted by looking in
> > current/$pg_HEAD directories and picking out the ones with 0 or 1 objects
> > in them, doing ceph-objecstore-tool export to make a backup (just in
> > case), and then removing them (with ceph-objectstore-tool).  Be careful of
> > PGs for empty pools since those will be naturally empty (and you want
> > to keep them).
> > 
> > For (2), see the wip-prune-past-intervals-jewel branch in ceph-ci.git.. if
> > that is applied to the kraken branch it ought ot work (although it's
> > untested).  Alternatively, you can just upgrade to luminous, as it
> > implements a more sophisticated version of the same thing.  You need to
> > upgrade mons, mark all osds down, upgrade osds and start at least one of
> > them, and then set 'ceph osd require-osd-release luminous' before it'll
> > switch to the new past intervals representation.  Definitely test it on
> > your test cluster to ensure it reduces the memory usage!
> > 
> > If that doesn't sort things out we'll need to see a heap profile for an
> > OOMing OSD to make sure we know what is using all of the RAM...
> > 
> > sage
> > 
> 
> well, big thank you sage. we tested the upgrade on the test cluster. and it
> did work in an awesome way. and we did not even needed to remove any pg with
> the tool.

OMG what a relief!  This is great news for a Friday night!  :)

> I read about the prune thing in the release notes, so when our attempts failed
> to start the cluster, we tried upgrading, but it did not help. it turned out
> that we missed the 'ceph osd require-osd-release luminous' thing. I mean we
> was looking on the command in the release notes upgrade section, and said to
> each other "it dose not matter, it would only restrict the old osds from
> joining" and moved on. damn, we would be up a week ago.
> having said that, I think the release notes should highlight this in the
> future.

Good point--I'll update the wording to make it clear that a lot of new 
behavior does not kick in until the switch is flipped.

> now we have upgraded the production cluster, and it is up and running now,
> memory foot print was down to the tenth. the largest ram using osd i saw was
> about 6.5GB.
> but we faced some issues, particularly OSDs crashing with "FAILED
> assert(interval.last > last)"

Just to clarify: this production cluster never ran the hacky kraken patch 
I posted that prune past intervals, right?  If so, then yes, please open a 
ticket with any/all osd.299 bugs you still have.  If it ran the kraken 
patch, then let's not bother--I don't want to confuse the situation with 
logs from that weird code.

Anyway, I'm delighted you're back up (and I'm sorry the release notes 
wording didn't help lead you there a week ago)!

FWIW the 6.5gb still sounds very high; I'd confirm that after the cluster 
stabilizes a restarted OSD shrinks back down to a more typical size (I 
suspect the allocator isn't releasing memory back to the OS due to 
fragmentation etc).

Thanks, and enjoy the weekend!
sage


> 
> 
> logs:
> 
>    -34> 2017-08-26 00:38:00.505114 7f14556b4d00  0 osd.299 1085665 load_pgs
> opened 455 pgs
>    -33> 2017-08-26 00:38:00.505787 7f14556b4d00 10 osd.299 1085665 19.f1e
> needs 1050342-1085230
>    -32> 2017-08-26 00:38:00.505814 7f14556b4d00  1 osd.299 1085665
> build_past_intervals_parallel over 1050342-1085230
>    -31> 2017-08-26 00:38:00.505818 7f14556b4d00 10 osd.299 1085665
> build_past_intervals_parallel epoch 1050342
>    -30> 2017-08-26 00:38:00.505824 7f14556b4d00 20 osd.299 0 get_map 1050342 -
> loading and decoding 0x7f14b3dfb0c0
>    -29> 2017-08-26 00:38:00.506245 7f14556b4d00 10 osd.299 0 add_map_bl
> 1050342 780781 bytes
>    -28> 2017-08-26 00:38:00.508539 7f14556b4d00 10 osd.299 1085665
> build_past_intervals_parallel epoch 1050342 pg 19.f1e first map, acting [80]
> up [80], same_interval_since = 1050342
>    -27> 2017-08-26 00:38:00.508547 7f14556b4d00 10 osd.299 1085665
> build_past_intervals_parallel epoch 1050343
>    -26> 2017-08-26 00:38:00.508550 7f14556b4d00 20 osd.299 0 get_map 1050343 -
> loading and decoding 0x7f14b3dfad80
>    -25> 2017-08-26 00:38:00.508997 7f14556b4d00 10 osd.299 0 add_map_bl
> 1050343 781371 bytes
>    -24> 2017-08-26 00:38:00.511176 7f14556b4d00 10 osd.299 1085665
> build_past_intervals_parallel epoch 1050344
>    -23> 2017-08-26 00:38:00.511196 7f14556b4d00 20 osd.299 0 get_map 1050344 -
> loading and decoding 0x7f14b3dfb740
>    -22> 2017-08-26 00:38:00.511625 7f14556b4d00 10 osd.299 0 add_map_bl
> 1050344 782446 bytes
>    -21> 2017-08-26 00:38:00.513813 7f14556b4d00 10 osd.299 1085665
> build_past_intervals_parallel epoch 1050345
>    -20> 2017-08-26 00:38:00.513820 7f14556b4d00 20 osd.299 0 get_map 1050345 -
> loading and decoding 0x7f14b3dfba80
>    -19> 2017-08-26 00:38:00.514260 7f14556b4d00 10 osd.299 0 add_map_bl
> 1050345 782071 bytes
>    -18> 2017-08-26 00:38:00.516463 7f14556b4d00 10 osd.299 1085665
> build_past_intervals_parallel epoch 1050346
>    -17> 2017-08-26 00:38:00.516488 7f14556b4d00 20 osd.299 0 get_map 1050346 -
> loading and decoding 0x7f14b79c4000
>    -16> 2017-08-26 00:38:00.516927 7f14556b4d00 10 osd.299 0 add_map_bl
> 1050346 781955 bytes
>    -15> 2017-08-26 00:38:00.519047 7f14556b4d00 10 osd.299 1085665
> build_past_intervals_parallel epoch 1050347
>    -14> 2017-08-26 00:38:00.519054 7f14556b4d00 20 osd.299 0 get_map 1050347 -
> loading and decoding 0x7f14b79c4340
>    -13> 2017-08-26 00:38:00.519500 7f14556b4d00 10 osd.299 0 add_map_bl
> 1050347 781930 bytes
>    -12> 2017-08-26 00:38:00.521612 7f14556b4d00 10 osd.299 1085665
> build_past_intervals_parallel epoch 1050348
>    -11> 2017-08-26 00:38:00.521619 7f14556b4d00 20 osd.299 0 get_map 1050348 -
> loading and decoding 0x7f14b79c4680
>    -10> 2017-08-26 00:38:00.522074 7f14556b4d00 10 osd.299 0 add_map_bl
> 1050348 784883 bytes
>     -9> 2017-08-26 00:38:00.524245 7f14556b4d00 10 osd.299 1085665
> build_past_intervals_parallel epoch 1050349
>     -8> 2017-08-26 00:38:00.524252 7f14556b4d00 20 osd.299 0 get_map 1050349 -
> loading and decoding 0x7f14b79c49c0
>     -7> 2017-08-26 00:38:00.524706 7f14556b4d00 10 osd.299 0 add_map_bl
> 1050349 785081 bytes
>     -6> 2017-08-26 00:38:00.526854 7f14556b4d00 10 osd.299 1085665
> build_past_intervals_parallel epoch 1050350
>     -5> 2017-08-26 00:38:00.526861 7f14556b4d00 20 osd.299 0 get_map 1050350 -
> loading and decoding 0x7f14b79c4d00
>     -4> 2017-08-26 00:38:00.527330 7f14556b4d00 10 osd.299 0 add_map_bl
> 1050350 785948 bytes
>     -3> 2017-08-26 00:38:00.529505 7f14556b4d00 10 osd.299 1085665
> build_past_intervals_parallel epoch 1050351
>     -2> 2017-08-26 00:38:00.529512 7f14556b4d00 20 osd.299 0 get_map 1050351 -
> loading and decoding 0x7f14b79c5040
>     -1> 2017-08-26 00:38:00.529979 7f14556b4d00 10 osd.299 0 add_map_bl
> 1050351 788650 bytes
>      0> 2017-08-26 00:38:00.534373 7f14556b4d00 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.4/rpm/el7/BUILD/ceph-12.1.4/src/osd/osd_types.cc:
> In function 'virtual void pi_compact_rep::add_interval(bool, const
> PastIntervals::pg_interval_t&)' thread 7f14556b4d00 time 2017-08-26
> 00:38:00.532119
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.4/rpm/el7/BUILD/ceph-12.1.4/src/osd/osd_types.cc:
> 3205: FAILED assert(interval.last > last)
> 
>  ceph version 12.1.4 (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous (rc)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x110) [0x7f145612f420]
>  2: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t
> const&)+0x3b2) [0x7f1455e030b2]
>  3: (PastIntervals::check_new_interval(int, int, std::vector<int,
> std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&,
> int, int, std::vector<int, std::allocator<int> > const&, std::vector<int,
> std::allocator<int> > const&, unsigned int, unsigned int,
> std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t,
> IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380)
> [0x7f1455de8ab0]
>  4: (OSD::build_past_intervals_parallel()+0xa8f) [0x7f1455bbc71f]
>  5: (OSD::load_pgs()+0x503) [0x7f1455bbef13]
>  6: (OSD::init()+0x2179) [0x7f1455bd7779]
>  7: (main()+0x2def) [0x7f1455add56f]
>  8: (__libc_start_main()+0xf5) [0x7f1451d14b35]
>  9: (()+0x4ac8a6) [0x7f1455b7b8a6]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
> interpret this.
> 
> --- logging levels ---
>    0/ 5 none
>    0/ 1 lockdep
>    0/ 1 context
>    1/ 1 crush
>    1/ 5 mds
>    1/ 5 mds_balancer
>    1/ 5 mds_locker
>    1/ 5 mds_log
>    1/ 5 mds_log_expire
>    1/ 5 mds_migrator
>    0/ 1 buffer
>    0/ 1 timer
>    0/ 1 filer
>    0/ 1 striper
>    0/ 1 objecter
>    0/ 5 rados
>    0/ 5 rbd
>    0/ 5 rbd_mirror
>    0/ 5 rbd_replay
>    0/ 5 journaler
>    0/ 5 objectcacher
>    0/ 5 client
>   20/20 osd
>    0/ 5 optracker
>    0/ 5 objclass
>    1/ 3 filestore
>    1/ 3 journal
>    0/ 5 ms
>    1/ 5 mon
>    0/10 monc
>    1/ 5 paxos
>    0/ 5 tp
>    1/ 5 auth
>    1/ 5 crypto
>    1/ 1 finisher
>    1/ 5 heartbeatmap
>    1/ 5 perfcounter
>    1/ 5 rgw
>    1/10 civetweb
>    1/ 5 javaclient
>    1/ 5 asok
>    1/ 1 throttle
>    0/ 0 refs
>    1/ 5 xio
>    1/ 5 compressor
>    1/ 5 bluestore
>    1/ 5 bluefs
>    1/ 3 bdev
>    1/ 5 kstore
>    4/ 5 rocksdb
>    4/ 5 leveldb
>    4/ 5 memdb
>    1/ 5 kinetic
>    1/ 5 fuse
>    1/ 5 mgr
>    1/ 5 mgrc
>    1/ 5 dpdk
>    1/ 5 eventtrace
>   -2/-2 (syslog threshold)
>   -1/-1 (stderr threshold)
>   max_recent     10000
>   max_new         1000
>   log_file /var/log/ceph/ceph-osd.299.log
> --- end dump of recent events ---
> 2017-08-26 00:38:00.572479 7f14556b4d00 -1 *** Caught signal (Aborted) **
>  in thread 7f14556b4d00 thread_name:ceph-osd
> 
>  ceph version 12.1.4 (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous (rc)
>  1: (()+0xa21a01) [0x7f14560f0a01]
>  2: (()+0xf370) [0x7f1452cfe370]
>  3: (gsignal()+0x37) [0x7f1451d281d7]
>  4: (abort()+0x148) [0x7f1451d298c8]
>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x284) [0x7f145612f594]
>  6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t
> const&)+0x3b2) [0x7f1455e030b2]
>  7: (PastIntervals::check_new_interval(int, int, std::vector<int,
> std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&,
> int, int, std::vector<int, std::allocator<int> > const&, std::vector<int,
> std::allocator<int> > const&, unsigned int, unsigned int,
> std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t,
> IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380)
> [0x7f1455de8ab0]
>  8: (OSD::build_past_intervals_parallel()+0xa8f) [0x7f1455bbc71f]
>  9: (OSD::load_pgs()+0x503) [0x7f1455bbef13]
>  10: (OSD::init()+0x2179) [0x7f1455bd7779]
>  11: (main()+0x2def) [0x7f1455add56f]
>  12: (__libc_start_main()+0xf5) [0x7f1451d14b35]
>  13: (()+0x4ac8a6) [0x7f1455b7b8a6]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
> interpret this.
> 
> --- begin dump of recent events ---
>      0> 2017-08-26 00:38:00.572479 7f14556b4d00 -1 *** Caught signal (Aborted)
> **
>  in thread 7f14556b4d00 thread_name:ceph-osd
> 
>  ceph version 12.1.4 (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous (rc)
>  1: (()+0xa21a01) [0x7f14560f0a01]
>  2: (()+0xf370) [0x7f1452cfe370]
>  3: (gsignal()+0x37) [0x7f1451d281d7]
>  4: (abort()+0x148) [0x7f1451d298c8]
>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x284) [0x7f145612f594]
>  6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t
> const&)+0x3b2) [0x7f1455e030b2]
>  7: (PastIntervals::check_new_interval(int, int, std::vector<int,
> std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&,
> int, int, std::vector<int, std::allocator<int> > const&, std::vector<int,
> std::allocator<int> > const&, unsigned int, unsigned int,
> std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t,
> IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380)
> [0x7f1455de8ab0]
>  8: (OSD::build_past_intervals_parallel()+0xa8f) [0x7f1455bbc71f]
>  9: (OSD::load_pgs()+0x503) [0x7f1455bbef13]
>  10: (OSD::init()+0x2179) [0x7f1455bd7779]
>  11: (main()+0x2def) [0x7f1455add56f]
>  12: (__libc_start_main()+0xf5) [0x7f1451d14b35]
>  13: (()+0x4ac8a6) [0x7f1455b7b8a6]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
> interpret this.
> 
> --- logging levels ---
>    0/ 5 none
>    0/ 1 lockdep
>    0/ 1 context
>    1/ 1 crush
>    1/ 5 mds
>    1/ 5 mds_balancer
>    1/ 5 mds_locker
>    1/ 5 mds_log
>    1/ 5 mds_log_expire
>    1/ 5 mds_migrator
>    0/ 1 buffer
>    0/ 1 timer
>    0/ 1 filer
>    0/ 1 striper
>    0/ 1 objecter
>    0/ 5 rados
>    0/ 5 rbd
>    0/ 5 rbd_mirror
>    0/ 5 rbd_replay
>    0/ 5 journaler
>    0/ 5 objectcacher
>    0/ 5 client
>   20/20 osd
>    0/ 5 optracker
>    0/ 5 objclass
>    1/ 3 filestore
>    1/ 3 journal
>    0/ 5 ms
>    1/ 5 mon
>    0/10 monc
>    1/ 5 paxos
>    0/ 5 tp
>    1/ 5 auth
>    1/ 5 crypto
>    1/ 1 finisher
>    1/ 5 heartbeatmap
>    1/ 5 perfcounter
>    1/ 5 rgw
>    1/10 civetweb
>    1/ 5 javaclient
>    1/ 5 asok
>    1/ 1 throttle
>    0/ 0 refs
>    1/ 5 xio
>    1/ 5 compressor
>    1/ 5 bluestore
>    1/ 5 bluefs
>    1/ 3 bdev
>    1/ 5 kstore
>    4/ 5 rocksdb
>    4/ 5 leveldb
>    4/ 5 memdb
>    1/ 5 kinetic
>    1/ 5 fuse
>    1/ 5 mgr
>    1/ 5 mgrc
>    1/ 5 dpdk
>    1/ 5 eventtrace
>   -2/-2 (syslog threshold)
>   -1/-1 (stderr threshold)
>   max_recent     10000
>   max_new         1000
>   log_file /var/log/ceph/ceph-osd.299.log
> --- end dump of recent events ---
> 
> 
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-299/ --op info --pgid
> 19.f1e
> {
>     "pgid": "19.f1e",
>     "last_update": "0'0",
>     "last_complete": "0'0",
>     "log_tail": "0'0",
>     "last_user_version": 0,
>     "last_backfill": "MAX",
>     "last_backfill_bitwise": 0,
>     "purged_snaps": [],
>     "history": {
>         "epoch_created": 1084817,
>         "epoch_pool_created": 1084817,
>         "last_epoch_started": 1085232,
>         "last_interval_started": 1085230,
>         "last_epoch_clean": 1050342,
>         "last_interval_clean": 1050342,
>         "last_epoch_split": 0,
>         "last_epoch_marked_full": 1061015,
>         "same_up_since": 1085230,
>         "same_interval_since": 1085230,
>         "same_primary_since": 1085230,
>         "last_scrub": "960114'865853",
>         "last_scrub_stamp": "2017-08-25 17:32:06.181006",
>         "last_deep_scrub": "952725'861179",
>         "last_deep_scrub_stamp": "2017-08-25 17:32:06.181006",
>         "last_clean_scrub_stamp": "2017-08-25 17:32:06.181006"
>     },
>     "stats": {
>         "version": "0'0",
>         "reported_seq": "424",
>         "reported_epoch": "1085650",
>         "state": "active+undersized+degraded",
>         "last_fresh": "2017-08-25 18:52:46.520078",
>         "last_change": "2017-08-25 18:38:16.356266",
>         "last_active": "2017-08-25 18:52:46.520078",
>         "last_peered": "2017-08-25 18:52:46.520078",
>         "last_clean": "2017-08-25 17:32:06.181006",
>         "last_became_active": "2017-08-25 18:38:16.356266",
>         "last_became_peered": "2017-08-25 18:38:16.356266",
>         "last_unstale": "2017-08-25 18:52:46.520078",
>         "last_undegraded": "2017-08-25 18:38:16.304877",
>         "last_fullsized": "2017-08-25 18:38:16.304877",
>         "mapping_epoch": 1085230,
>         "log_start": "0'0",
>         "ondisk_log_start": "0'0",
>         "created": 1084817,
>         "last_epoch_clean": 1050342,
>         "parent": "0.0",
>         "parent_split_bits": 0,
>         "last_scrub": "960114'865853",
>         "last_scrub_stamp": "2017-08-25 17:32:06.181006",
>         "last_deep_scrub": "952725'861179",
>         "last_deep_scrub_stamp": "2017-08-25 17:32:06.181006",
>         "last_clean_scrub_stamp": "2017-08-25 17:32:06.181006",
>         "log_size": 0,
>         "ondisk_log_size": 0,
>         "stats_invalid": false,
>         "dirty_stats_invalid": false,
>         "omap_stats_invalid": false,
>         "hitset_stats_invalid": false,
>         "hitset_bytes_stats_invalid": false,
>         "pin_stats_invalid": false,
>         "stat_sum": {
>             "num_bytes": 0,
>             "num_objects": 0,
>             "num_object_clones": 0,
>             "num_object_copies": 0,
>             "num_objects_missing_on_primary": 0,
>             "num_objects_missing": 0,
>             "num_objects_degraded": 0,
>             "num_objects_misplaced": 0,
>             "num_objects_unfound": 0,
>             "num_objects_dirty": 0,
>             "num_whiteouts": 0,
>             "num_read": 0,
>             "num_read_kb": 0,
>             "num_write": 0,
>             "num_write_kb": 0,
>             "num_scrub_errors": 0,
>             "num_shallow_scrub_errors": 0,
>             "num_deep_scrub_errors": 0,
>             "num_objects_recovered": 0,
>             "num_bytes_recovered": 0,
>             "num_keys_recovered": 0,
>             "num_objects_omap": 0,
>             "num_objects_hit_set_archive": 0,
>             "num_bytes_hit_set_archive": 0,
>             "num_flush": 0,
>             "num_flush_kb": 0,
>             "num_evict": 0,
>             "num_evict_kb": 0,
>             "num_promote": 0,
>             "num_flush_mode_high": 0,
>             "num_flush_mode_low": 0,
>             "num_evict_mode_some": 0,
>             "num_evict_mode_full": 0,
>             "num_objects_pinned": 0,
>             "num_legacy_snapsets": 0
>         },
>         "up": [
>             299
>         ],
>         "acting": [
>             299
>         ],
>         "blocked_by": [],
>         "up_primary": 299,
>         "acting_primary": 299
>     },
>     "empty": 1,
>     "dne": 0,
>     "incomplete": 0,
>     "last_epoch_started": 1085232,
>     "hit_set_history": {
>         "current_last_update": "0'0",
>         "history": []
>     }
> }
> 
> 
> ll /var/lib/ceph/osd/ceph-299/current/19.f1e_head/
> total 0
> -rw-r--r-- 1 root root 0 Aug 25 18:38 __head_00000F1E__13
> 
> ceph pg 19.f1e query
> .
> .
>             "blocked": "peering is blocked due to down osds",
>             "down_osds_we_would_probe": [
>                 299
>             ],
>             "peering_blocked_by": [
>                 {
>                     "osd": 299,
>                     "current_lost_at": 0,
>                     "comment": "starting or marking this osd lost may let us
> proceed"
>                 }
>             ]
> .
> .
> .
> 
> 
> 
> removing the pg with ceph-objectstore-tool did the trick. but i am not sure if
> that will happen to a pg with real data in it or not.
> should i report this in the bug tracker?
> 
> thanks
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: High memory usage kills OSD while peering
  2017-08-25 22:46                   ` Sage Weil
@ 2017-08-25 22:49                     ` Sage Weil
  2017-08-25 23:03                       ` Linux Chips
  2017-08-25 23:03                     ` Linux Chips
  1 sibling, 1 reply; 27+ messages in thread
From: Sage Weil @ 2017-08-25 22:49 UTC (permalink / raw)
  To: Linux Chips; +Cc: Mustafa Muhammad, ceph-devel

On Fri, 25 Aug 2017, Sage Weil wrote:
> On Sat, 26 Aug 2017, Linux Chips wrote:
> > I read about the prune thing in the release notes, so when our attempts failed
> > to start the cluster, we tried upgrading, but it did not help. it turned out
> > that we missed the 'ceph osd require-osd-release luminous' thing. I mean we
> > was looking on the command in the release notes upgrade section, and said to
> > each other "it dose not matter, it would only restrict the old osds from
> > joining" and moved on. damn, we would be up a week ago.
> > having said that, I think the release notes should highlight this in the
> > future.
> 
> Good point--I'll update the wording to make it clear that a lot of new 
> behavior does not kick in until the switch is flipped.

Is this better?

	https://github.com/ceph/ceph/pull/17270

Thanks!
sage

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: High memory usage kills OSD while peering
  2017-08-25 22:46                   ` Sage Weil
  2017-08-25 22:49                     ` Sage Weil
@ 2017-08-25 23:03                     ` Linux Chips
  2017-08-25 23:08                       ` Sage Weil
  1 sibling, 1 reply; 27+ messages in thread
From: Linux Chips @ 2017-08-25 23:03 UTC (permalink / raw)
  To: Sage Weil; +Cc: Mustafa Muhammad, ceph-devel

On 08/26/2017 01:46 AM, Sage Weil wrote:
> On Sat, 26 Aug 2017, Linux Chips wrote:
>> On 08/24/2017 06:58 AM, Sage Weil wrote:
>>>
>>> Okay, so I think the combination of (1) removing empty PGs and (2) pruning
>>> past_intervals will help.  (1) can be scripted by looking in
>>> current/$pg_HEAD directories and picking out the ones with 0 or 1 objects
>>> in them, doing ceph-objecstore-tool export to make a backup (just in
>>> case), and then removing them (with ceph-objectstore-tool).  Be careful of
>>> PGs for empty pools since those will be naturally empty (and you want
>>> to keep them).
>>>
>>> For (2), see the wip-prune-past-intervals-jewel branch in ceph-ci.git.. if
>>> that is applied to the kraken branch it ought ot work (although it's
>>> untested).  Alternatively, you can just upgrade to luminous, as it
>>> implements a more sophisticated version of the same thing.  You need to
>>> upgrade mons, mark all osds down, upgrade osds and start at least one of
>>> them, and then set 'ceph osd require-osd-release luminous' before it'll
>>> switch to the new past intervals representation.  Definitely test it on
>>> your test cluster to ensure it reduces the memory usage!
>>>
>>> If that doesn't sort things out we'll need to see a heap profile for an
>>> OOMing OSD to make sure we know what is using all of the RAM...
>>>
>>> sage
>>>
>>
>> well, big thank you sage. we tested the upgrade on the test cluster. and it
>> did work in an awesome way. and we did not even needed to remove any pg with
>> the tool.
> 
> OMG what a relief!  This is great news for a Friday night!  :)
> 
>> I read about the prune thing in the release notes, so when our attempts failed
>> to start the cluster, we tried upgrading, but it did not help. it turned out
>> that we missed the 'ceph osd require-osd-release luminous' thing. I mean we
>> was looking on the command in the release notes upgrade section, and said to
>> each other "it dose not matter, it would only restrict the old osds from
>> joining" and moved on. damn, we would be up a week ago.
>> having said that, I think the release notes should highlight this in the
>> future.
> 
> Good point--I'll update the wording to make it clear that a lot of new
> behavior does not kick in until the switch is flipped.
> 
>> now we have upgraded the production cluster, and it is up and running now,
>> memory foot print was down to the tenth. the largest ram using osd i saw was
>> about 6.5GB.
>> but we faced some issues, particularly OSDs crashing with "FAILED
>> assert(interval.last > last)"
> 
> Just to clarify: this production cluster never ran the hacky kraken patch
> I posted that prune past intervals, right?  If so, then yes, please open a
> ticket with any/all osd.299 bugs you still have.  If it ran the kraken
> patch, then let's not bother--I don't want to confuse the situation with
> logs from that weird code.no we did not run any patched versions, this cluster only ran on the 
packages from the official repo.
any more info we need to get into the bug report before i clean the osds up?
> 
> Anyway, I'm delighted you're back up (and I'm sorry the release notes
> wording didn't help lead you there a week ago)!
> 
> FWIW the 6.5gb still sounds very high; I'd confirm that after the cluster
> stabilizes a restarted OSD shrinks back down to a more typical size (I
> suspect the allocator isn't releasing memory back to the OS due to
> fragmentation etc).
to be clear here, the 6.5 was during peering operation. the usage got 
lower after a few minuets. they are now at about 2-2.5GB. sometimes they 
spike to 4G, i think when starting a new recovery or something.
another thing we noticed. in kraken (and jewel) if the noup is set, the 
starting osd will consume about 3G ram by the time it is waiting for the 
noup flag removal. while in luminous, it was about 700-800 MB.

> 
> Thanks, and enjoy the weekend!
> sage
> 
> 
>>
>>
>> logs:
>>
>>     -34> 2017-08-26 00:38:00.505114 7f14556b4d00  0 osd.299 1085665 load_pgs
>> opened 455 pgs
>>     -33> 2017-08-26 00:38:00.505787 7f14556b4d00 10 osd.299 1085665 19.f1e
>> needs 1050342-1085230
>>     -32> 2017-08-26 00:38:00.505814 7f14556b4d00  1 osd.299 1085665
>> build_past_intervals_parallel over 1050342-1085230
>>     -31> 2017-08-26 00:38:00.505818 7f14556b4d00 10 osd.299 1085665
>> build_past_intervals_parallel epoch 1050342
>>     -30> 2017-08-26 00:38:00.505824 7f14556b4d00 20 osd.299 0 get_map 1050342 -
>> loading and decoding 0x7f14b3dfb0c0
>>     -29> 2017-08-26 00:38:00.506245 7f14556b4d00 10 osd.299 0 add_map_bl
>> 1050342 780781 bytes
>>     -28> 2017-08-26 00:38:00.508539 7f14556b4d00 10 osd.299 1085665
>> build_past_intervals_parallel epoch 1050342 pg 19.f1e first map, acting [80]
>> up [80], same_interval_since = 1050342
>>     -27> 2017-08-26 00:38:00.508547 7f14556b4d00 10 osd.299 1085665
>> build_past_intervals_parallel epoch 1050343
>>     -26> 2017-08-26 00:38:00.508550 7f14556b4d00 20 osd.299 0 get_map 1050343 -
>> loading and decoding 0x7f14b3dfad80
>>     -25> 2017-08-26 00:38:00.508997 7f14556b4d00 10 osd.299 0 add_map_bl
>> 1050343 781371 bytes
>>     -24> 2017-08-26 00:38:00.511176 7f14556b4d00 10 osd.299 1085665
>> build_past_intervals_parallel epoch 1050344
>>     -23> 2017-08-26 00:38:00.511196 7f14556b4d00 20 osd.299 0 get_map 1050344 -
>> loading and decoding 0x7f14b3dfb740
>>     -22> 2017-08-26 00:38:00.511625 7f14556b4d00 10 osd.299 0 add_map_bl
>> 1050344 782446 bytes
>>     -21> 2017-08-26 00:38:00.513813 7f14556b4d00 10 osd.299 1085665
>> build_past_intervals_parallel epoch 1050345
>>     -20> 2017-08-26 00:38:00.513820 7f14556b4d00 20 osd.299 0 get_map 1050345 -
>> loading and decoding 0x7f14b3dfba80
>>     -19> 2017-08-26 00:38:00.514260 7f14556b4d00 10 osd.299 0 add_map_bl
>> 1050345 782071 bytes
>>     -18> 2017-08-26 00:38:00.516463 7f14556b4d00 10 osd.299 1085665
>> build_past_intervals_parallel epoch 1050346
>>     -17> 2017-08-26 00:38:00.516488 7f14556b4d00 20 osd.299 0 get_map 1050346 -
>> loading and decoding 0x7f14b79c4000
>>     -16> 2017-08-26 00:38:00.516927 7f14556b4d00 10 osd.299 0 add_map_bl
>> 1050346 781955 bytes
>>     -15> 2017-08-26 00:38:00.519047 7f14556b4d00 10 osd.299 1085665
>> build_past_intervals_parallel epoch 1050347
>>     -14> 2017-08-26 00:38:00.519054 7f14556b4d00 20 osd.299 0 get_map 1050347 -
>> loading and decoding 0x7f14b79c4340
>>     -13> 2017-08-26 00:38:00.519500 7f14556b4d00 10 osd.299 0 add_map_bl
>> 1050347 781930 bytes
>>     -12> 2017-08-26 00:38:00.521612 7f14556b4d00 10 osd.299 1085665
>> build_past_intervals_parallel epoch 1050348
>>     -11> 2017-08-26 00:38:00.521619 7f14556b4d00 20 osd.299 0 get_map 1050348 -
>> loading and decoding 0x7f14b79c4680
>>     -10> 2017-08-26 00:38:00.522074 7f14556b4d00 10 osd.299 0 add_map_bl
>> 1050348 784883 bytes
>>      -9> 2017-08-26 00:38:00.524245 7f14556b4d00 10 osd.299 1085665
>> build_past_intervals_parallel epoch 1050349
>>      -8> 2017-08-26 00:38:00.524252 7f14556b4d00 20 osd.299 0 get_map 1050349 -
>> loading and decoding 0x7f14b79c49c0
>>      -7> 2017-08-26 00:38:00.524706 7f14556b4d00 10 osd.299 0 add_map_bl
>> 1050349 785081 bytes
>>      -6> 2017-08-26 00:38:00.526854 7f14556b4d00 10 osd.299 1085665
>> build_past_intervals_parallel epoch 1050350
>>      -5> 2017-08-26 00:38:00.526861 7f14556b4d00 20 osd.299 0 get_map 1050350 -
>> loading and decoding 0x7f14b79c4d00
>>      -4> 2017-08-26 00:38:00.527330 7f14556b4d00 10 osd.299 0 add_map_bl
>> 1050350 785948 bytes
>>      -3> 2017-08-26 00:38:00.529505 7f14556b4d00 10 osd.299 1085665
>> build_past_intervals_parallel epoch 1050351
>>      -2> 2017-08-26 00:38:00.529512 7f14556b4d00 20 osd.299 0 get_map 1050351 -
>> loading and decoding 0x7f14b79c5040
>>      -1> 2017-08-26 00:38:00.529979 7f14556b4d00 10 osd.299 0 add_map_bl
>> 1050351 788650 bytes
>>       0> 2017-08-26 00:38:00.534373 7f14556b4d00 -1
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.4/rpm/el7/BUILD/ceph-12.1.4/src/osd/osd_types.cc:
>> In function 'virtual void pi_compact_rep::add_interval(bool, const
>> PastIntervals::pg_interval_t&)' thread 7f14556b4d00 time 2017-08-26
>> 00:38:00.532119
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.4/rpm/el7/BUILD/ceph-12.1.4/src/osd/osd_types.cc:
>> 3205: FAILED assert(interval.last > last)
>>
>>   ceph version 12.1.4 (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous (rc)
>>   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x110) [0x7f145612f420]
>>   2: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t
>> const&)+0x3b2) [0x7f1455e030b2]
>>   3: (PastIntervals::check_new_interval(int, int, std::vector<int,
>> std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&,
>> int, int, std::vector<int, std::allocator<int> > const&, std::vector<int,
>> std::allocator<int> > const&, unsigned int, unsigned int,
>> std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t,
>> IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380)
>> [0x7f1455de8ab0]
>>   4: (OSD::build_past_intervals_parallel()+0xa8f) [0x7f1455bbc71f]
>>   5: (OSD::load_pgs()+0x503) [0x7f1455bbef13]
>>   6: (OSD::init()+0x2179) [0x7f1455bd7779]
>>   7: (main()+0x2def) [0x7f1455add56f]
>>   8: (__libc_start_main()+0xf5) [0x7f1451d14b35]
>>   9: (()+0x4ac8a6) [0x7f1455b7b8a6]
>>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
>> interpret this.
>>
>> --- logging levels ---
>>     0/ 5 none
>>     0/ 1 lockdep
>>     0/ 1 context
>>     1/ 1 crush
>>     1/ 5 mds
>>     1/ 5 mds_balancer
>>     1/ 5 mds_locker
>>     1/ 5 mds_log
>>     1/ 5 mds_log_expire
>>     1/ 5 mds_migrator
>>     0/ 1 buffer
>>     0/ 1 timer
>>     0/ 1 filer
>>     0/ 1 striper
>>     0/ 1 objecter
>>     0/ 5 rados
>>     0/ 5 rbd
>>     0/ 5 rbd_mirror
>>     0/ 5 rbd_replay
>>     0/ 5 journaler
>>     0/ 5 objectcacher
>>     0/ 5 client
>>    20/20 osd
>>     0/ 5 optracker
>>     0/ 5 objclass
>>     1/ 3 filestore
>>     1/ 3 journal
>>     0/ 5 ms
>>     1/ 5 mon
>>     0/10 monc
>>     1/ 5 paxos
>>     0/ 5 tp
>>     1/ 5 auth
>>     1/ 5 crypto
>>     1/ 1 finisher
>>     1/ 5 heartbeatmap
>>     1/ 5 perfcounter
>>     1/ 5 rgw
>>     1/10 civetweb
>>     1/ 5 javaclient
>>     1/ 5 asok
>>     1/ 1 throttle
>>     0/ 0 refs
>>     1/ 5 xio
>>     1/ 5 compressor
>>     1/ 5 bluestore
>>     1/ 5 bluefs
>>     1/ 3 bdev
>>     1/ 5 kstore
>>     4/ 5 rocksdb
>>     4/ 5 leveldb
>>     4/ 5 memdb
>>     1/ 5 kinetic
>>     1/ 5 fuse
>>     1/ 5 mgr
>>     1/ 5 mgrc
>>     1/ 5 dpdk
>>     1/ 5 eventtrace
>>    -2/-2 (syslog threshold)
>>    -1/-1 (stderr threshold)
>>    max_recent     10000
>>    max_new         1000
>>    log_file /var/log/ceph/ceph-osd.299.log
>> --- end dump of recent events ---
>> 2017-08-26 00:38:00.572479 7f14556b4d00 -1 *** Caught signal (Aborted) **
>>   in thread 7f14556b4d00 thread_name:ceph-osd
>>
>>   ceph version 12.1.4 (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous (rc)
>>   1: (()+0xa21a01) [0x7f14560f0a01]
>>   2: (()+0xf370) [0x7f1452cfe370]
>>   3: (gsignal()+0x37) [0x7f1451d281d7]
>>   4: (abort()+0x148) [0x7f1451d298c8]
>>   5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x284) [0x7f145612f594]
>>   6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t
>> const&)+0x3b2) [0x7f1455e030b2]
>>   7: (PastIntervals::check_new_interval(int, int, std::vector<int,
>> std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&,
>> int, int, std::vector<int, std::allocator<int> > const&, std::vector<int,
>> std::allocator<int> > const&, unsigned int, unsigned int,
>> std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t,
>> IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380)
>> [0x7f1455de8ab0]
>>   8: (OSD::build_past_intervals_parallel()+0xa8f) [0x7f1455bbc71f]
>>   9: (OSD::load_pgs()+0x503) [0x7f1455bbef13]
>>   10: (OSD::init()+0x2179) [0x7f1455bd7779]
>>   11: (main()+0x2def) [0x7f1455add56f]
>>   12: (__libc_start_main()+0xf5) [0x7f1451d14b35]
>>   13: (()+0x4ac8a6) [0x7f1455b7b8a6]
>>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
>> interpret this.
>>
>> --- begin dump of recent events ---
>>       0> 2017-08-26 00:38:00.572479 7f14556b4d00 -1 *** Caught signal (Aborted)
>> **
>>   in thread 7f14556b4d00 thread_name:ceph-osd
>>
>>   ceph version 12.1.4 (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous (rc)
>>   1: (()+0xa21a01) [0x7f14560f0a01]
>>   2: (()+0xf370) [0x7f1452cfe370]
>>   3: (gsignal()+0x37) [0x7f1451d281d7]
>>   4: (abort()+0x148) [0x7f1451d298c8]
>>   5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x284) [0x7f145612f594]
>>   6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t
>> const&)+0x3b2) [0x7f1455e030b2]
>>   7: (PastIntervals::check_new_interval(int, int, std::vector<int,
>> std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&,
>> int, int, std::vector<int, std::allocator<int> > const&, std::vector<int,
>> std::allocator<int> > const&, unsigned int, unsigned int,
>> std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t,
>> IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380)
>> [0x7f1455de8ab0]
>>   8: (OSD::build_past_intervals_parallel()+0xa8f) [0x7f1455bbc71f]
>>   9: (OSD::load_pgs()+0x503) [0x7f1455bbef13]
>>   10: (OSD::init()+0x2179) [0x7f1455bd7779]
>>   11: (main()+0x2def) [0x7f1455add56f]
>>   12: (__libc_start_main()+0xf5) [0x7f1451d14b35]
>>   13: (()+0x4ac8a6) [0x7f1455b7b8a6]
>>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
>> interpret this.
>>
>> --- logging levels ---
>>     0/ 5 none
>>     0/ 1 lockdep
>>     0/ 1 context
>>     1/ 1 crush
>>     1/ 5 mds
>>     1/ 5 mds_balancer
>>     1/ 5 mds_locker
>>     1/ 5 mds_log
>>     1/ 5 mds_log_expire
>>     1/ 5 mds_migrator
>>     0/ 1 buffer
>>     0/ 1 timer
>>     0/ 1 filer
>>     0/ 1 striper
>>     0/ 1 objecter
>>     0/ 5 rados
>>     0/ 5 rbd
>>     0/ 5 rbd_mirror
>>     0/ 5 rbd_replay
>>     0/ 5 journaler
>>     0/ 5 objectcacher
>>     0/ 5 client
>>    20/20 osd
>>     0/ 5 optracker
>>     0/ 5 objclass
>>     1/ 3 filestore
>>     1/ 3 journal
>>     0/ 5 ms
>>     1/ 5 mon
>>     0/10 monc
>>     1/ 5 paxos
>>     0/ 5 tp
>>     1/ 5 auth
>>     1/ 5 crypto
>>     1/ 1 finisher
>>     1/ 5 heartbeatmap
>>     1/ 5 perfcounter
>>     1/ 5 rgw
>>     1/10 civetweb
>>     1/ 5 javaclient
>>     1/ 5 asok
>>     1/ 1 throttle
>>     0/ 0 refs
>>     1/ 5 xio
>>     1/ 5 compressor
>>     1/ 5 bluestore
>>     1/ 5 bluefs
>>     1/ 3 bdev
>>     1/ 5 kstore
>>     4/ 5 rocksdb
>>     4/ 5 leveldb
>>     4/ 5 memdb
>>     1/ 5 kinetic
>>     1/ 5 fuse
>>     1/ 5 mgr
>>     1/ 5 mgrc
>>     1/ 5 dpdk
>>     1/ 5 eventtrace
>>    -2/-2 (syslog threshold)
>>    -1/-1 (stderr threshold)
>>    max_recent     10000
>>    max_new         1000
>>    log_file /var/log/ceph/ceph-osd.299.log
>> --- end dump of recent events ---
>>
>>
>> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-299/ --op info --pgid
>> 19.f1e
>> {
>>      "pgid": "19.f1e",
>>      "last_update": "0'0",
>>      "last_complete": "0'0",
>>      "log_tail": "0'0",
>>      "last_user_version": 0,
>>      "last_backfill": "MAX",
>>      "last_backfill_bitwise": 0,
>>      "purged_snaps": [],
>>      "history": {
>>          "epoch_created": 1084817,
>>          "epoch_pool_created": 1084817,
>>          "last_epoch_started": 1085232,
>>          "last_interval_started": 1085230,
>>          "last_epoch_clean": 1050342,
>>          "last_interval_clean": 1050342,
>>          "last_epoch_split": 0,
>>          "last_epoch_marked_full": 1061015,
>>          "same_up_since": 1085230,
>>          "same_interval_since": 1085230,
>>          "same_primary_since": 1085230,
>>          "last_scrub": "960114'865853",
>>          "last_scrub_stamp": "2017-08-25 17:32:06.181006",
>>          "last_deep_scrub": "952725'861179",
>>          "last_deep_scrub_stamp": "2017-08-25 17:32:06.181006",
>>          "last_clean_scrub_stamp": "2017-08-25 17:32:06.181006"
>>      },
>>      "stats": {
>>          "version": "0'0",
>>          "reported_seq": "424",
>>          "reported_epoch": "1085650",
>>          "state": "active+undersized+degraded",
>>          "last_fresh": "2017-08-25 18:52:46.520078",
>>          "last_change": "2017-08-25 18:38:16.356266",
>>          "last_active": "2017-08-25 18:52:46.520078",
>>          "last_peered": "2017-08-25 18:52:46.520078",
>>          "last_clean": "2017-08-25 17:32:06.181006",
>>          "last_became_active": "2017-08-25 18:38:16.356266",
>>          "last_became_peered": "2017-08-25 18:38:16.356266",
>>          "last_unstale": "2017-08-25 18:52:46.520078",
>>          "last_undegraded": "2017-08-25 18:38:16.304877",
>>          "last_fullsized": "2017-08-25 18:38:16.304877",
>>          "mapping_epoch": 1085230,
>>          "log_start": "0'0",
>>          "ondisk_log_start": "0'0",
>>          "created": 1084817,
>>          "last_epoch_clean": 1050342,
>>          "parent": "0.0",
>>          "parent_split_bits": 0,
>>          "last_scrub": "960114'865853",
>>          "last_scrub_stamp": "2017-08-25 17:32:06.181006",
>>          "last_deep_scrub": "952725'861179",
>>          "last_deep_scrub_stamp": "2017-08-25 17:32:06.181006",
>>          "last_clean_scrub_stamp": "2017-08-25 17:32:06.181006",
>>          "log_size": 0,
>>          "ondisk_log_size": 0,
>>          "stats_invalid": false,
>>          "dirty_stats_invalid": false,
>>          "omap_stats_invalid": false,
>>          "hitset_stats_invalid": false,
>>          "hitset_bytes_stats_invalid": false,
>>          "pin_stats_invalid": false,
>>          "stat_sum": {
>>              "num_bytes": 0,
>>              "num_objects": 0,
>>              "num_object_clones": 0,
>>              "num_object_copies": 0,
>>              "num_objects_missing_on_primary": 0,
>>              "num_objects_missing": 0,
>>              "num_objects_degraded": 0,
>>              "num_objects_misplaced": 0,
>>              "num_objects_unfound": 0,
>>              "num_objects_dirty": 0,
>>              "num_whiteouts": 0,
>>              "num_read": 0,
>>              "num_read_kb": 0,
>>              "num_write": 0,
>>              "num_write_kb": 0,
>>              "num_scrub_errors": 0,
>>              "num_shallow_scrub_errors": 0,
>>              "num_deep_scrub_errors": 0,
>>              "num_objects_recovered": 0,
>>              "num_bytes_recovered": 0,
>>              "num_keys_recovered": 0,
>>              "num_objects_omap": 0,
>>              "num_objects_hit_set_archive": 0,
>>              "num_bytes_hit_set_archive": 0,
>>              "num_flush": 0,
>>              "num_flush_kb": 0,
>>              "num_evict": 0,
>>              "num_evict_kb": 0,
>>              "num_promote": 0,
>>              "num_flush_mode_high": 0,
>>              "num_flush_mode_low": 0,
>>              "num_evict_mode_some": 0,
>>              "num_evict_mode_full": 0,
>>              "num_objects_pinned": 0,
>>              "num_legacy_snapsets": 0
>>          },
>>          "up": [
>>              299
>>          ],
>>          "acting": [
>>              299
>>          ],
>>          "blocked_by": [],
>>          "up_primary": 299,
>>          "acting_primary": 299
>>      },
>>      "empty": 1,
>>      "dne": 0,
>>      "incomplete": 0,
>>      "last_epoch_started": 1085232,
>>      "hit_set_history": {
>>          "current_last_update": "0'0",
>>          "history": []
>>      }
>> }
>>
>>
>> ll /var/lib/ceph/osd/ceph-299/current/19.f1e_head/
>> total 0
>> -rw-r--r-- 1 root root 0 Aug 25 18:38 __head_00000F1E__13
>>
>> ceph pg 19.f1e query
>> .
>> .
>>              "blocked": "peering is blocked due to down osds",
>>              "down_osds_we_would_probe": [
>>                  299
>>              ],
>>              "peering_blocked_by": [
>>                  {
>>                      "osd": 299,
>>                      "current_lost_at": 0,
>>                      "comment": "starting or marking this osd lost may let us
>> proceed"
>>                  }
>>              ]
>> .
>> .
>> .
>>
>>
>>
>> removing the pg with ceph-objectstore-tool did the trick. but i am not sure if
>> that will happen to a pg with real data in it or not.
>> should i report this in the bug tracker?
>>
>> thanks
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: High memory usage kills OSD while peering
  2017-08-25 22:49                     ` Sage Weil
@ 2017-08-25 23:03                       ` Linux Chips
  0 siblings, 0 replies; 27+ messages in thread
From: Linux Chips @ 2017-08-25 23:03 UTC (permalink / raw)
  To: Sage Weil; +Cc: Mustafa Muhammad, ceph-devel

On 08/26/2017 01:49 AM, Sage Weil wrote:
> On Fri, 25 Aug 2017, Sage Weil wrote:
>> On Sat, 26 Aug 2017, Linux Chips wrote:
>>> I read about the prune thing in the release notes, so when our attempts failed
>>> to start the cluster, we tried upgrading, but it did not help. it turned out
>>> that we missed the 'ceph osd require-osd-release luminous' thing. I mean we
>>> was looking on the command in the release notes upgrade section, and said to
>>> each other "it dose not matter, it would only restrict the old osds from
>>> joining" and moved on. damn, we would be up a week ago.
>>> having said that, I think the release notes should highlight this in the
>>> future.
>>
>> Good point--I'll update the wording to make it clear that a lot of new
>> behavior does not kick in until the switch is flipped.
> 
> Is this better?
> 
> 	https://github.com/ceph/ceph/pull/17270
> 
> Thanks!
> sage
> 
much better
thanks


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: High memory usage kills OSD while peering
  2017-08-25 23:03                     ` Linux Chips
@ 2017-08-25 23:08                       ` Sage Weil
  2017-08-26 12:13                         ` Linux Chips
  0 siblings, 1 reply; 27+ messages in thread
From: Sage Weil @ 2017-08-25 23:08 UTC (permalink / raw)
  To: Linux Chips; +Cc: Mustafa Muhammad, ceph-devel

On Sat, 26 Aug 2017, Linux Chips wrote:
> On 08/26/2017 01:46 AM, Sage Weil wrote:
> > On Sat, 26 Aug 2017, Linux Chips wrote:
> > > On 08/24/2017 06:58 AM, Sage Weil wrote:
> > > > 
> > > > Okay, so I think the combination of (1) removing empty PGs and (2)
> > > > pruning
> > > > past_intervals will help.  (1) can be scripted by looking in
> > > > current/$pg_HEAD directories and picking out the ones with 0 or 1
> > > > objects
> > > > in them, doing ceph-objecstore-tool export to make a backup (just in
> > > > case), and then removing them (with ceph-objectstore-tool).  Be careful
> > > > of
> > > > PGs for empty pools since those will be naturally empty (and you want
> > > > to keep them).
> > > > 
> > > > For (2), see the wip-prune-past-intervals-jewel branch in ceph-ci.git..
> > > > if
> > > > that is applied to the kraken branch it ought ot work (although it's
> > > > untested).  Alternatively, you can just upgrade to luminous, as it
> > > > implements a more sophisticated version of the same thing.  You need to
> > > > upgrade mons, mark all osds down, upgrade osds and start at least one of
> > > > them, and then set 'ceph osd require-osd-release luminous' before it'll
> > > > switch to the new past intervals representation.  Definitely test it on
> > > > your test cluster to ensure it reduces the memory usage!
> > > > 
> > > > If that doesn't sort things out we'll need to see a heap profile for an
> > > > OOMing OSD to make sure we know what is using all of the RAM...
> > > > 
> > > > sage
> > > > 
> > > 
> > > well, big thank you sage. we tested the upgrade on the test cluster. and
> > > it
> > > did work in an awesome way. and we did not even needed to remove any pg
> > > with
> > > the tool.
> > 
> > OMG what a relief!  This is great news for a Friday night!  :)
> > 
> > > I read about the prune thing in the release notes, so when our attempts
> > > failed
> > > to start the cluster, we tried upgrading, but it did not help. it turned
> > > out
> > > that we missed the 'ceph osd require-osd-release luminous' thing. I mean
> > > we
> > > was looking on the command in the release notes upgrade section, and said
> > > to
> > > each other "it dose not matter, it would only restrict the old osds from
> > > joining" and moved on. damn, we would be up a week ago.
> > > having said that, I think the release notes should highlight this in the
> > > future.
> > 
> > Good point--I'll update the wording to make it clear that a lot of new
> > behavior does not kick in until the switch is flipped.
> > 
> > > now we have upgraded the production cluster, and it is up and running now,
> > > memory foot print was down to the tenth. the largest ram using osd i saw
> > > was
> > > about 6.5GB.
> > > but we faced some issues, particularly OSDs crashing with "FAILED
> > > assert(interval.last > last)"
> > 
> > Just to clarify: this production cluster never ran the hacky kraken patch
> > I posted that prune past intervals, right?  If so, then yes, please open a
> > ticket with any/all osd.299 bugs you still have.  If it ran the kraken
> > patch, then let's not bother--I don't want to confuse the situation with
> > logs from that weird code.no we did not run any patched versions, this
> > cluster only ran on the 
> packages from the official repo.
> any more info we need to get into the bug report before i clean the osds up?

Hmm, if you can dump the osdmaps from the mon around the failure epoch,
1050351 (say, +/1 100 epochs to be safe?) that would help.  You can stick 
them in a directory with the logs and use ceph-post-file to upload the 
whole thing.

> > Anyway, I'm delighted you're back up (and I'm sorry the release notes
> > wording didn't help lead you there a week ago)!
> > 
> > FWIW the 6.5gb still sounds very high; I'd confirm that after the cluster
> > stabilizes a restarted OSD shrinks back down to a more typical size (I
> > suspect the allocator isn't releasing memory back to the OS due to
> > fragmentation etc).
> to be clear here, the 6.5 was during peering operation. the usage got lower
> after a few minuets. they are now at about 2-2.5GB. sometimes they spike to
> 4G, i think when starting a new recovery or something.
> another thing we noticed. in kraken (and jewel) if the noup is set, the
> starting osd will consume about 3G ram by the time it is waiting for the noup
> flag removal. while in luminous, it was about 700-800 MB.

This is probably mostly due to the smaller default osdmap cache size, but 
it's good to see that it's effective.

Thanks!
sage



> 
> > 
> > Thanks, and enjoy the weekend!
> > sage
> > 
> > 
> > > 
> > > 
> > > logs:
> > > 
> > >     -34> 2017-08-26 00:38:00.505114 7f14556b4d00  0 osd.299 1085665
> > > load_pgs
> > > opened 455 pgs
> > >     -33> 2017-08-26 00:38:00.505787 7f14556b4d00 10 osd.299 1085665 19.f1e
> > > needs 1050342-1085230
> > >     -32> 2017-08-26 00:38:00.505814 7f14556b4d00  1 osd.299 1085665
> > > build_past_intervals_parallel over 1050342-1085230
> > >     -31> 2017-08-26 00:38:00.505818 7f14556b4d00 10 osd.299 1085665
> > > build_past_intervals_parallel epoch 1050342
> > >     -30> 2017-08-26 00:38:00.505824 7f14556b4d00 20 osd.299 0 get_map
> > > 1050342 -
> > > loading and decoding 0x7f14b3dfb0c0
> > >     -29> 2017-08-26 00:38:00.506245 7f14556b4d00 10 osd.299 0 add_map_bl
> > > 1050342 780781 bytes
> > >     -28> 2017-08-26 00:38:00.508539 7f14556b4d00 10 osd.299 1085665
> > > build_past_intervals_parallel epoch 1050342 pg 19.f1e first map, acting
> > > [80]
> > > up [80], same_interval_since = 1050342
> > >     -27> 2017-08-26 00:38:00.508547 7f14556b4d00 10 osd.299 1085665
> > > build_past_intervals_parallel epoch 1050343
> > >     -26> 2017-08-26 00:38:00.508550 7f14556b4d00 20 osd.299 0 get_map
> > > 1050343 -
> > > loading and decoding 0x7f14b3dfad80
> > >     -25> 2017-08-26 00:38:00.508997 7f14556b4d00 10 osd.299 0 add_map_bl
> > > 1050343 781371 bytes
> > >     -24> 2017-08-26 00:38:00.511176 7f14556b4d00 10 osd.299 1085665
> > > build_past_intervals_parallel epoch 1050344
> > >     -23> 2017-08-26 00:38:00.511196 7f14556b4d00 20 osd.299 0 get_map
> > > 1050344 -
> > > loading and decoding 0x7f14b3dfb740
> > >     -22> 2017-08-26 00:38:00.511625 7f14556b4d00 10 osd.299 0 add_map_bl
> > > 1050344 782446 bytes
> > >     -21> 2017-08-26 00:38:00.513813 7f14556b4d00 10 osd.299 1085665
> > > build_past_intervals_parallel epoch 1050345
> > >     -20> 2017-08-26 00:38:00.513820 7f14556b4d00 20 osd.299 0 get_map
> > > 1050345 -
> > > loading and decoding 0x7f14b3dfba80
> > >     -19> 2017-08-26 00:38:00.514260 7f14556b4d00 10 osd.299 0 add_map_bl
> > > 1050345 782071 bytes
> > >     -18> 2017-08-26 00:38:00.516463 7f14556b4d00 10 osd.299 1085665
> > > build_past_intervals_parallel epoch 1050346
> > >     -17> 2017-08-26 00:38:00.516488 7f14556b4d00 20 osd.299 0 get_map
> > > 1050346 -
> > > loading and decoding 0x7f14b79c4000
> > >     -16> 2017-08-26 00:38:00.516927 7f14556b4d00 10 osd.299 0 add_map_bl
> > > 1050346 781955 bytes
> > >     -15> 2017-08-26 00:38:00.519047 7f14556b4d00 10 osd.299 1085665
> > > build_past_intervals_parallel epoch 1050347
> > >     -14> 2017-08-26 00:38:00.519054 7f14556b4d00 20 osd.299 0 get_map
> > > 1050347 -
> > > loading and decoding 0x7f14b79c4340
> > >     -13> 2017-08-26 00:38:00.519500 7f14556b4d00 10 osd.299 0 add_map_bl
> > > 1050347 781930 bytes
> > >     -12> 2017-08-26 00:38:00.521612 7f14556b4d00 10 osd.299 1085665
> > > build_past_intervals_parallel epoch 1050348
> > >     -11> 2017-08-26 00:38:00.521619 7f14556b4d00 20 osd.299 0 get_map
> > > 1050348 -
> > > loading and decoding 0x7f14b79c4680
> > >     -10> 2017-08-26 00:38:00.522074 7f14556b4d00 10 osd.299 0 add_map_bl
> > > 1050348 784883 bytes
> > >      -9> 2017-08-26 00:38:00.524245 7f14556b4d00 10 osd.299 1085665
> > > build_past_intervals_parallel epoch 1050349
> > >      -8> 2017-08-26 00:38:00.524252 7f14556b4d00 20 osd.299 0 get_map
> > > 1050349 -
> > > loading and decoding 0x7f14b79c49c0
> > >      -7> 2017-08-26 00:38:00.524706 7f14556b4d00 10 osd.299 0 add_map_bl
> > > 1050349 785081 bytes
> > >      -6> 2017-08-26 00:38:00.526854 7f14556b4d00 10 osd.299 1085665
> > > build_past_intervals_parallel epoch 1050350
> > >      -5> 2017-08-26 00:38:00.526861 7f14556b4d00 20 osd.299 0 get_map
> > > 1050350 -
> > > loading and decoding 0x7f14b79c4d00
> > >      -4> 2017-08-26 00:38:00.527330 7f14556b4d00 10 osd.299 0 add_map_bl
> > > 1050350 785948 bytes
> > >      -3> 2017-08-26 00:38:00.529505 7f14556b4d00 10 osd.299 1085665
> > > build_past_intervals_parallel epoch 1050351
> > >      -2> 2017-08-26 00:38:00.529512 7f14556b4d00 20 osd.299 0 get_map
> > > 1050351 -
> > > loading and decoding 0x7f14b79c5040
> > >      -1> 2017-08-26 00:38:00.529979 7f14556b4d00 10 osd.299 0 add_map_bl
> > > 1050351 788650 bytes
> > >       0> 2017-08-26 00:38:00.534373 7f14556b4d00 -1
> > > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.4/rpm/el7/BUILD/ceph-12.1.4/src/osd/osd_types.cc:
> > > In function 'virtual void pi_compact_rep::add_interval(bool, const
> > > PastIntervals::pg_interval_t&)' thread 7f14556b4d00 time 2017-08-26
> > > 00:38:00.532119
> > > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.4/rpm/el7/BUILD/ceph-12.1.4/src/osd/osd_types.cc:
> > > 3205: FAILED assert(interval.last > last)
> > > 
> > >   ceph version 12.1.4 (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous
> > > (rc)
> > >   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > > const*)+0x110) [0x7f145612f420]
> > >   2: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t
> > > const&)+0x3b2) [0x7f1455e030b2]
> > >   3: (PastIntervals::check_new_interval(int, int, std::vector<int,
> > > std::allocator<int> > const&, std::vector<int, std::allocator<int> >
> > > const&,
> > > int, int, std::vector<int, std::allocator<int> > const&, std::vector<int,
> > > std::allocator<int> > const&, unsigned int, unsigned int,
> > > std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t,
> > > IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380)
> > > [0x7f1455de8ab0]
> > >   4: (OSD::build_past_intervals_parallel()+0xa8f) [0x7f1455bbc71f]
> > >   5: (OSD::load_pgs()+0x503) [0x7f1455bbef13]
> > >   6: (OSD::init()+0x2179) [0x7f1455bd7779]
> > >   7: (main()+0x2def) [0x7f1455add56f]
> > >   8: (__libc_start_main()+0xf5) [0x7f1451d14b35]
> > >   9: (()+0x4ac8a6) [0x7f1455b7b8a6]
> > >   NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> > > to
> > > interpret this.
> > > 
> > > --- logging levels ---
> > >     0/ 5 none
> > >     0/ 1 lockdep
> > >     0/ 1 context
> > >     1/ 1 crush
> > >     1/ 5 mds
> > >     1/ 5 mds_balancer
> > >     1/ 5 mds_locker
> > >     1/ 5 mds_log
> > >     1/ 5 mds_log_expire
> > >     1/ 5 mds_migrator
> > >     0/ 1 buffer
> > >     0/ 1 timer
> > >     0/ 1 filer
> > >     0/ 1 striper
> > >     0/ 1 objecter
> > >     0/ 5 rados
> > >     0/ 5 rbd
> > >     0/ 5 rbd_mirror
> > >     0/ 5 rbd_replay
> > >     0/ 5 journaler
> > >     0/ 5 objectcacher
> > >     0/ 5 client
> > >    20/20 osd
> > >     0/ 5 optracker
> > >     0/ 5 objclass
> > >     1/ 3 filestore
> > >     1/ 3 journal
> > >     0/ 5 ms
> > >     1/ 5 mon
> > >     0/10 monc
> > >     1/ 5 paxos
> > >     0/ 5 tp
> > >     1/ 5 auth
> > >     1/ 5 crypto
> > >     1/ 1 finisher
> > >     1/ 5 heartbeatmap
> > >     1/ 5 perfcounter
> > >     1/ 5 rgw
> > >     1/10 civetweb
> > >     1/ 5 javaclient
> > >     1/ 5 asok
> > >     1/ 1 throttle
> > >     0/ 0 refs
> > >     1/ 5 xio
> > >     1/ 5 compressor
> > >     1/ 5 bluestore
> > >     1/ 5 bluefs
> > >     1/ 3 bdev
> > >     1/ 5 kstore
> > >     4/ 5 rocksdb
> > >     4/ 5 leveldb
> > >     4/ 5 memdb
> > >     1/ 5 kinetic
> > >     1/ 5 fuse
> > >     1/ 5 mgr
> > >     1/ 5 mgrc
> > >     1/ 5 dpdk
> > >     1/ 5 eventtrace
> > >    -2/-2 (syslog threshold)
> > >    -1/-1 (stderr threshold)
> > >    max_recent     10000
> > >    max_new         1000
> > >    log_file /var/log/ceph/ceph-osd.299.log
> > > --- end dump of recent events ---
> > > 2017-08-26 00:38:00.572479 7f14556b4d00 -1 *** Caught signal (Aborted) **
> > >   in thread 7f14556b4d00 thread_name:ceph-osd
> > > 
> > >   ceph version 12.1.4 (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous
> > > (rc)
> > >   1: (()+0xa21a01) [0x7f14560f0a01]
> > >   2: (()+0xf370) [0x7f1452cfe370]
> > >   3: (gsignal()+0x37) [0x7f1451d281d7]
> > >   4: (abort()+0x148) [0x7f1451d298c8]
> > >   5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > > const*)+0x284) [0x7f145612f594]
> > >   6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t
> > > const&)+0x3b2) [0x7f1455e030b2]
> > >   7: (PastIntervals::check_new_interval(int, int, std::vector<int,
> > > std::allocator<int> > const&, std::vector<int, std::allocator<int> >
> > > const&,
> > > int, int, std::vector<int, std::allocator<int> > const&, std::vector<int,
> > > std::allocator<int> > const&, unsigned int, unsigned int,
> > > std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t,
> > > IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380)
> > > [0x7f1455de8ab0]
> > >   8: (OSD::build_past_intervals_parallel()+0xa8f) [0x7f1455bbc71f]
> > >   9: (OSD::load_pgs()+0x503) [0x7f1455bbef13]
> > >   10: (OSD::init()+0x2179) [0x7f1455bd7779]
> > >   11: (main()+0x2def) [0x7f1455add56f]
> > >   12: (__libc_start_main()+0xf5) [0x7f1451d14b35]
> > >   13: (()+0x4ac8a6) [0x7f1455b7b8a6]
> > >   NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> > > to
> > > interpret this.
> > > 
> > > --- begin dump of recent events ---
> > >       0> 2017-08-26 00:38:00.572479 7f14556b4d00 -1 *** Caught signal
> > > (Aborted)
> > > **
> > >   in thread 7f14556b4d00 thread_name:ceph-osd
> > > 
> > >   ceph version 12.1.4 (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous
> > > (rc)
> > >   1: (()+0xa21a01) [0x7f14560f0a01]
> > >   2: (()+0xf370) [0x7f1452cfe370]
> > >   3: (gsignal()+0x37) [0x7f1451d281d7]
> > >   4: (abort()+0x148) [0x7f1451d298c8]
> > >   5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > > const*)+0x284) [0x7f145612f594]
> > >   6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t
> > > const&)+0x3b2) [0x7f1455e030b2]
> > >   7: (PastIntervals::check_new_interval(int, int, std::vector<int,
> > > std::allocator<int> > const&, std::vector<int, std::allocator<int> >
> > > const&,
> > > int, int, std::vector<int, std::allocator<int> > const&, std::vector<int,
> > > std::allocator<int> > const&, unsigned int, unsigned int,
> > > std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t,
> > > IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380)
> > > [0x7f1455de8ab0]
> > >   8: (OSD::build_past_intervals_parallel()+0xa8f) [0x7f1455bbc71f]
> > >   9: (OSD::load_pgs()+0x503) [0x7f1455bbef13]
> > >   10: (OSD::init()+0x2179) [0x7f1455bd7779]
> > >   11: (main()+0x2def) [0x7f1455add56f]
> > >   12: (__libc_start_main()+0xf5) [0x7f1451d14b35]
> > >   13: (()+0x4ac8a6) [0x7f1455b7b8a6]
> > >   NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> > > to
> > > interpret this.
> > > 
> > > --- logging levels ---
> > >     0/ 5 none
> > >     0/ 1 lockdep
> > >     0/ 1 context
> > >     1/ 1 crush
> > >     1/ 5 mds
> > >     1/ 5 mds_balancer
> > >     1/ 5 mds_locker
> > >     1/ 5 mds_log
> > >     1/ 5 mds_log_expire
> > >     1/ 5 mds_migrator
> > >     0/ 1 buffer
> > >     0/ 1 timer
> > >     0/ 1 filer
> > >     0/ 1 striper
> > >     0/ 1 objecter
> > >     0/ 5 rados
> > >     0/ 5 rbd
> > >     0/ 5 rbd_mirror
> > >     0/ 5 rbd_replay
> > >     0/ 5 journaler
> > >     0/ 5 objectcacher
> > >     0/ 5 client
> > >    20/20 osd
> > >     0/ 5 optracker
> > >     0/ 5 objclass
> > >     1/ 3 filestore
> > >     1/ 3 journal
> > >     0/ 5 ms
> > >     1/ 5 mon
> > >     0/10 monc
> > >     1/ 5 paxos
> > >     0/ 5 tp
> > >     1/ 5 auth
> > >     1/ 5 crypto
> > >     1/ 1 finisher
> > >     1/ 5 heartbeatmap
> > >     1/ 5 perfcounter
> > >     1/ 5 rgw
> > >     1/10 civetweb
> > >     1/ 5 javaclient
> > >     1/ 5 asok
> > >     1/ 1 throttle
> > >     0/ 0 refs
> > >     1/ 5 xio
> > >     1/ 5 compressor
> > >     1/ 5 bluestore
> > >     1/ 5 bluefs
> > >     1/ 3 bdev
> > >     1/ 5 kstore
> > >     4/ 5 rocksdb
> > >     4/ 5 leveldb
> > >     4/ 5 memdb
> > >     1/ 5 kinetic
> > >     1/ 5 fuse
> > >     1/ 5 mgr
> > >     1/ 5 mgrc
> > >     1/ 5 dpdk
> > >     1/ 5 eventtrace
> > >    -2/-2 (syslog threshold)
> > >    -1/-1 (stderr threshold)
> > >    max_recent     10000
> > >    max_new         1000
> > >    log_file /var/log/ceph/ceph-osd.299.log
> > > --- end dump of recent events ---
> > > 
> > > 
> > > ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-299/ --op info
> > > --pgid
> > > 19.f1e
> > > {
> > >      "pgid": "19.f1e",
> > >      "last_update": "0'0",
> > >      "last_complete": "0'0",
> > >      "log_tail": "0'0",
> > >      "last_user_version": 0,
> > >      "last_backfill": "MAX",
> > >      "last_backfill_bitwise": 0,
> > >      "purged_snaps": [],
> > >      "history": {
> > >          "epoch_created": 1084817,
> > >          "epoch_pool_created": 1084817,
> > >          "last_epoch_started": 1085232,
> > >          "last_interval_started": 1085230,
> > >          "last_epoch_clean": 1050342,
> > >          "last_interval_clean": 1050342,
> > >          "last_epoch_split": 0,
> > >          "last_epoch_marked_full": 1061015,
> > >          "same_up_since": 1085230,
> > >          "same_interval_since": 1085230,
> > >          "same_primary_since": 1085230,
> > >          "last_scrub": "960114'865853",
> > >          "last_scrub_stamp": "2017-08-25 17:32:06.181006",
> > >          "last_deep_scrub": "952725'861179",
> > >          "last_deep_scrub_stamp": "2017-08-25 17:32:06.181006",
> > >          "last_clean_scrub_stamp": "2017-08-25 17:32:06.181006"
> > >      },
> > >      "stats": {
> > >          "version": "0'0",
> > >          "reported_seq": "424",
> > >          "reported_epoch": "1085650",
> > >          "state": "active+undersized+degraded",
> > >          "last_fresh": "2017-08-25 18:52:46.520078",
> > >          "last_change": "2017-08-25 18:38:16.356266",
> > >          "last_active": "2017-08-25 18:52:46.520078",
> > >          "last_peered": "2017-08-25 18:52:46.520078",
> > >          "last_clean": "2017-08-25 17:32:06.181006",
> > >          "last_became_active": "2017-08-25 18:38:16.356266",
> > >          "last_became_peered": "2017-08-25 18:38:16.356266",
> > >          "last_unstale": "2017-08-25 18:52:46.520078",
> > >          "last_undegraded": "2017-08-25 18:38:16.304877",
> > >          "last_fullsized": "2017-08-25 18:38:16.304877",
> > >          "mapping_epoch": 1085230,
> > >          "log_start": "0'0",
> > >          "ondisk_log_start": "0'0",
> > >          "created": 1084817,
> > >          "last_epoch_clean": 1050342,
> > >          "parent": "0.0",
> > >          "parent_split_bits": 0,
> > >          "last_scrub": "960114'865853",
> > >          "last_scrub_stamp": "2017-08-25 17:32:06.181006",
> > >          "last_deep_scrub": "952725'861179",
> > >          "last_deep_scrub_stamp": "2017-08-25 17:32:06.181006",
> > >          "last_clean_scrub_stamp": "2017-08-25 17:32:06.181006",
> > >          "log_size": 0,
> > >          "ondisk_log_size": 0,
> > >          "stats_invalid": false,
> > >          "dirty_stats_invalid": false,
> > >          "omap_stats_invalid": false,
> > >          "hitset_stats_invalid": false,
> > >          "hitset_bytes_stats_invalid": false,
> > >          "pin_stats_invalid": false,
> > >          "stat_sum": {
> > >              "num_bytes": 0,
> > >              "num_objects": 0,
> > >              "num_object_clones": 0,
> > >              "num_object_copies": 0,
> > >              "num_objects_missing_on_primary": 0,
> > >              "num_objects_missing": 0,
> > >              "num_objects_degraded": 0,
> > >              "num_objects_misplaced": 0,
> > >              "num_objects_unfound": 0,
> > >              "num_objects_dirty": 0,
> > >              "num_whiteouts": 0,
> > >              "num_read": 0,
> > >              "num_read_kb": 0,
> > >              "num_write": 0,
> > >              "num_write_kb": 0,
> > >              "num_scrub_errors": 0,
> > >              "num_shallow_scrub_errors": 0,
> > >              "num_deep_scrub_errors": 0,
> > >              "num_objects_recovered": 0,
> > >              "num_bytes_recovered": 0,
> > >              "num_keys_recovered": 0,
> > >              "num_objects_omap": 0,
> > >              "num_objects_hit_set_archive": 0,
> > >              "num_bytes_hit_set_archive": 0,
> > >              "num_flush": 0,
> > >              "num_flush_kb": 0,
> > >              "num_evict": 0,
> > >              "num_evict_kb": 0,
> > >              "num_promote": 0,
> > >              "num_flush_mode_high": 0,
> > >              "num_flush_mode_low": 0,
> > >              "num_evict_mode_some": 0,
> > >              "num_evict_mode_full": 0,
> > >              "num_objects_pinned": 0,
> > >              "num_legacy_snapsets": 0
> > >          },
> > >          "up": [
> > >              299
> > >          ],
> > >          "acting": [
> > >              299
> > >          ],
> > >          "blocked_by": [],
> > >          "up_primary": 299,
> > >          "acting_primary": 299
> > >      },
> > >      "empty": 1,
> > >      "dne": 0,
> > >      "incomplete": 0,
> > >      "last_epoch_started": 1085232,
> > >      "hit_set_history": {
> > >          "current_last_update": "0'0",
> > >          "history": []
> > >      }
> > > }
> > > 
> > > 
> > > ll /var/lib/ceph/osd/ceph-299/current/19.f1e_head/
> > > total 0
> > > -rw-r--r-- 1 root root 0 Aug 25 18:38 __head_00000F1E__13
> > > 
> > > ceph pg 19.f1e query
> > > .
> > > .
> > >              "blocked": "peering is blocked due to down osds",
> > >              "down_osds_we_would_probe": [
> > >                  299
> > >              ],
> > >              "peering_blocked_by": [
> > >                  {
> > >                      "osd": 299,
> > >                      "current_lost_at": 0,
> > >                      "comment": "starting or marking this osd lost may let
> > > us
> > > proceed"
> > >                  }
> > >              ]
> > > .
> > > .
> > > .
> > > 
> > > 
> > > 
> > > removing the pg with ceph-objectstore-tool did the trick. but i am not
> > > sure if
> > > that will happen to a pg with real data in it or not.
> > > should i report this in the bug tracker?
> > > 
> > > thanks
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: High memory usage kills OSD while peering
  2017-08-25 23:08                       ` Sage Weil
@ 2017-08-26 12:13                         ` Linux Chips
  2017-08-26 21:17                           ` Linux Chips
  0 siblings, 1 reply; 27+ messages in thread
From: Linux Chips @ 2017-08-26 12:13 UTC (permalink / raw)
  To: Sage Weil; +Cc: Mustafa Muhammad, ceph-devel

Hi, I reported the bug #21142.
we are seeing another bug that killed some OSDs during peering. i ll 
collect more info and report another bug.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: High memory usage kills OSD while peering
  2017-08-26 12:13                         ` Linux Chips
@ 2017-08-26 21:17                           ` Linux Chips
  2017-08-27  2:00                             ` Sage Weil
  0 siblings, 1 reply; 27+ messages in thread
From: Linux Chips @ 2017-08-26 21:17 UTC (permalink / raw)
  To: Sage Weil, ceph-devel; +Cc: Mustafa Muhammad

Hi again,
now every thing almost sorted out. we had a few inconsistent shards that 
were killing the OSDs when recovering, we fixed some of them by removing 
the bad shards, and some by starting other OSDs with good shards.
what is stopping us now, is that one OSD had a corrupted leveldb and 
refuses to start.
not sure how that hapened, but i asume is due to the many times the 
node/osd died from lack of memory.
I am also not sure if we should continue the discussion here, or start a 
new thread.

the osd (262) is showing those logs upon start:

2017-08-26 17:07:17.915861 7fbd8e4cbd00  0 set uid:gid to 0:0 (:)
2017-08-26 17:07:17.915875 7fbd8e4cbd00  0 ceph version 12.1.4 
(a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous (rc), process 
(unknown), pid 26713
2017-08-26 17:07:17.927085 7fbd8e4cbd00  0 pidfile_write: ignore empty 
--pid-file
2017-08-26 17:07:17.951358 7fbd8e4cbd00  0 load: jerasure load: lrc 
load: isa
2017-08-26 17:07:17.951602 7fbd8e4cbd00  0 
filestore(/var/lib/ceph/osd/ceph-262) backend xfs (magic 0x58465342)
2017-08-26 17:07:17.952164 7fbd8e4cbd00  0 
filestore(/var/lib/ceph/osd/ceph-262) backend xfs (magic 0x58465342)
2017-08-26 17:07:17.952977 7fbd8e4cbd00  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-262) detect_features: 
FIEMAP ioctl is disabled via 'filestore fiemap' config option
2017-08-26 17:07:17.952983 7fbd8e4cbd00  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-262) detect_features: 
SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
2017-08-26 17:07:17.952985 7fbd8e4cbd00  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-262) detect_features: 
splice() is disabled via 'filestore splice' config option
2017-08-26 17:07:17.953309 7fbd8e4cbd00  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-262) detect_features: 
syncfs(2) syscall fully supported (by glibc and kernel)
2017-08-26 17:07:17.953797 7fbd8e4cbd00  0 
xfsfilestorebackend(/var/lib/ceph/osd/ceph-262) detect_feature: extsize 
is disabled by conf
2017-08-26 17:07:17.954628 7fbd8e4cbd00  0 
filestore(/var/lib/ceph/osd/ceph-262) start omap initiation
2017-08-26 17:07:17.957166 7fbd8e4cbd00 -1 
filestore(/var/lib/ceph/osd/ceph-262) mount(1724): Error initializing 
leveldb : Corruption: error in middle of record

2017-08-26 17:07:17.957179 7fbd8e4cbd00 -1 osd.262 0 OSD:init: unable to 
mount object store
2017-08-26 17:07:17.957183 7fbd8e4cbd00 -1  ** ERROR: osd init failed: 
(1) Operation not permitted

ceph-objectstore-tool shows similar errors.

so, we figured it is only one OSD and we can go without it. we marked it 
lost, pgs started to peer and got active. but 5 remain in the incomplete 
state. and te pg query shows:

...
     "recovery_state": [
         {
             "name": "Started/Primary/Peering/Incomplete",
             "enter_time": "2017-08-26 22:59:03.044623",
             "comment": "not enough complete instances of this PG"
         },
         {
             "name": "Started/Primary/Peering",
             "enter_time": "2017-08-26 22:59:02.540748",
             "past_intervals": [
                 {
                     "first": "959669",
                     "last": "1090812",
                     "all_participants": [
                         {
                             "osd": 258
                         },
                         {
                             "osd": 262
                         },
                         {
                             "osd": 338
                         },
                         {
                             "osd": 545
                         },
                         {
                             "osd": 549
                         }
                     ],
                     "intervals": [
                         {
                             "first": "964880",
                             "last": "964924",
                             "acting": "262"
                         },
                         {
                             "first": "978855",
                             "last": "978956",
                             "acting": "545"
                         },
                         {
                             "first": "989628",
                             "last": "989808",
                             "acting": "258"
                         },
                         {
                             "first": "992614",
                             "last": "992975",
                             "acting": "549"
                         },
                         {
                             "first": "1085148",
                             "last": "1090812",
                             "acting": "338"
                         }
                     ]
                 }
             ],
             "probing_osds": [
                 "258",
                 "338",
                 "545",
                 "549"
             ],
             "down_osds_we_would_probe": [
                 262
             ],
             "peering_blocked_by": [],
             "peering_blocked_by_detail": [
                 {
                     "detail": "peering_blocked_by_history_les_bound"
                 }
             ]
         },
...

not sure wat that detail "peering_blocked_by_history_les_bound" is, and 
not sure how to proceed. i googled it, came up with nothing useful.
all the incomplete pgs have the same detail as the above and similar 
recovery state.

ceph pg ls | grep incomplete
18.54b         0                  0        0         0       0 
  0                 2739                 2739 
                              incomplete 2017-08-26 23:15:46.705071 
   46889'4277      1091150:314001 
                  [332,253]        332 
                                         [332,253]            332 
46889'4277 2017-08-04 03:15:58.381025        46889'4277 2017-07-29 
06:47:30.337673 
 

19.54a      5950                  0        0         0       0 
26108435266                 3019                 3019 
                                      incomplete 2017-08-26 
23:15:46.705156     961411'873129    1091150:58116482 
                                      [332,253]        332 
                                                             [332,253] 
          332     960118'872495 2017-08-04 03:12:33.647414 
952850'868978 2017-07-02 15:53:08.565948
19.608         0                  0        0         0       0 
  0                    0                    0 
                              incomplete 2017-08-26 22:59:03.044649 
          0'0         1091150:428 
                  [258,338]        258 
                                         [258,338]            258 
960118'862299 2017-08-04 03:01:57.011411     958900'861456 2017-07-28 
02:33:29.476119
19.8bb         0                  0        0         0       0 
  0                    0                    0 
                              incomplete 2017-08-26 22:59:02.946453 
          0'0         1091150:339 
                  [260,331]        260 
                                         [260,331]            260 
960114'866811 2017-08-03 04:51:42.117840     952850'864443 2017-07-08 
02:48:37.958357
19.dd3      5864                  0        0         0       0 
25600089555                 3094                 3094 
                                      incomplete 2017-08-26 
17:20:07.948285     961411'865657    1091150:72381143 
                                      [263,142]        263 
                                                             [263,142] 
          263     960118'865078 2017-08-25 17:32:06.181006 
960118'865078 2017-08-25 17:32:06.181006


I also noticed that some of those have 0 objects in them despite the dir 
in one of the osds have objects in it.
these pools are replica 2


thanks
ali

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: High memory usage kills OSD while peering
  2017-08-26 21:17                           ` Linux Chips
@ 2017-08-27  2:00                             ` Sage Weil
  2017-08-29  7:44                               ` Mustafa Muhammad
  0 siblings, 1 reply; 27+ messages in thread
From: Sage Weil @ 2017-08-27  2:00 UTC (permalink / raw)
  To: Linux Chips; +Cc: ceph-devel, Mustafa Muhammad

On Sun, 27 Aug 2017, Linux Chips wrote:
> Hi again,
> now every thing almost sorted out. we had a few inconsistent shards that were
> killing the OSDs when recovering, we fixed some of them by removing the bad
> shards, and some by starting other OSDs with good shards.
> what is stopping us now, is that one OSD had a corrupted leveldb and refuses
> to start.
> not sure how that hapened, but i asume is due to the many times the node/osd
> died from lack of memory.
> I am also not sure if we should continue the discussion here, or start a new
> thread.
> 
> the osd (262) is showing those logs upon start:
> 
> 2017-08-26 17:07:17.915861 7fbd8e4cbd00  0 set uid:gid to 0:0 (:)
> 2017-08-26 17:07:17.915875 7fbd8e4cbd00  0 ceph version 12.1.4
> (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous (rc), process (unknown),
> pid 26713
> 2017-08-26 17:07:17.927085 7fbd8e4cbd00  0 pidfile_write: ignore empty
> --pid-file
> 2017-08-26 17:07:17.951358 7fbd8e4cbd00  0 load: jerasure load: lrc load: isa
> 2017-08-26 17:07:17.951602 7fbd8e4cbd00  0
> filestore(/var/lib/ceph/osd/ceph-262) backend xfs (magic 0x58465342)
> 2017-08-26 17:07:17.952164 7fbd8e4cbd00  0
> filestore(/var/lib/ceph/osd/ceph-262) backend xfs (magic 0x58465342)
> 2017-08-26 17:07:17.952977 7fbd8e4cbd00  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-262) detect_features: FIEMAP
> ioctl is disabled via 'filestore fiemap' config option
> 2017-08-26 17:07:17.952983 7fbd8e4cbd00  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-262) detect_features:
> SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
> 2017-08-26 17:07:17.952985 7fbd8e4cbd00  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-262) detect_features: splice()
> is disabled via 'filestore splice' config option
> 2017-08-26 17:07:17.953309 7fbd8e4cbd00  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-262) detect_features: syncfs(2)
> syscall fully supported (by glibc and kernel)
> 2017-08-26 17:07:17.953797 7fbd8e4cbd00  0
> xfsfilestorebackend(/var/lib/ceph/osd/ceph-262) detect_feature: extsize is
> disabled by conf
> 2017-08-26 17:07:17.954628 7fbd8e4cbd00  0
> filestore(/var/lib/ceph/osd/ceph-262) start omap initiation
> 2017-08-26 17:07:17.957166 7fbd8e4cbd00 -1
> filestore(/var/lib/ceph/osd/ceph-262) mount(1724): Error initializing leveldb
> : Corruption: error in middle of record
> 
> 2017-08-26 17:07:17.957179 7fbd8e4cbd00 -1 osd.262 0 OSD:init: unable to mount
> object store
> 2017-08-26 17:07:17.957183 7fbd8e4cbd00 -1  ** ERROR: osd init failed: (1)
> Operation not permitted
> 
> ceph-objectstore-tool shows similar errors.
> 
> so, we figured it is only one OSD and we can go without it. we marked it lost,
> pgs started to peer and got active. but 5 remain in the incomplete state. and
> te pg query shows:
> 
> ...
>     "recovery_state": [
>         {
>             "name": "Started/Primary/Peering/Incomplete",
>             "enter_time": "2017-08-26 22:59:03.044623",
>             "comment": "not enough complete instances of this PG"
>         },
>         {
>             "name": "Started/Primary/Peering",
>             "enter_time": "2017-08-26 22:59:02.540748",
>             "past_intervals": [
>                 {
>                     "first": "959669",
>                     "last": "1090812",
>                     "all_participants": [
>                         {
>                             "osd": 258
>                         },
>                         {
>                             "osd": 262
>                         },
>                         {
>                             "osd": 338
>                         },
>                         {
>                             "osd": 545
>                         },
>                         {
>                             "osd": 549
>                         }
>                     ],
>                     "intervals": [
>                         {
>                             "first": "964880",
>                             "last": "964924",
>                             "acting": "262"
>                         },
>                         {
>                             "first": "978855",
>                             "last": "978956",
>                             "acting": "545"
>                         },
>                         {
>                             "first": "989628",
>                             "last": "989808",
>                             "acting": "258"
>                         },
>                         {
>                             "first": "992614",
>                             "last": "992975",
>                             "acting": "549"
>                         },
>                         {
>                             "first": "1085148",
>                             "last": "1090812",
>                             "acting": "338"
>                         }
>                     ]
>                 }
>             ],
>             "probing_osds": [
>                 "258",
>                 "338",
>                 "545",
>                 "549"
>             ],
>             "down_osds_we_would_probe": [
>                 262
>             ],
>             "peering_blocked_by": [],
>             "peering_blocked_by_detail": [
>                 {
>                     "detail": "peering_blocked_by_history_les_bound"
>                 }
>             ]
>         },
> ...
> 
> not sure wat that detail "peering_blocked_by_history_les_bound" is, and not
> sure how to proceed. i googled it, came up with nothing useful.
> all the incomplete pgs have the same detail as the above and similar recovery
> state.

It means that the pg metadata suggests that the PG may have gone active 
elsewhere, but we don't actually have any evidence that there were newer 
updates.  Since that OSD won't start and you can't extract the needed PGs 
from it with ceph-objectstore-tool export (or maybe you can get it from 
elsewhere?) there isn't much to lose by bypassing the check.  The config 
option has to be set to true on the primary OSD for the PG and peering 
retriggered (e.g., by marking the primary down with 'ceph osd down NN').

I'd test it on the 0 object PGs first :)

sage


> > ceph pg ls | grep incomplete
> 18.54b         0                  0        0         0       0  0
> 2739                 2739                              incomplete 2017-08-26
> 23:15:46.705071   46889'4277      1091150:314001                  [332,253]
> 332                                         [332,253]            332
> 46889'4277 2017-08-04 03:15:58.381025        46889'4277 2017-07-29
> 06:47:30.337673 
> 
> 19.54a      5950                  0        0         0       0 26108435266
> 3019                 3019                                      incomplete
> 2017-08-26 23:15:46.705156     961411'873129    1091150:58116482
> [332,253]        332
> [332,253]          332     960118'872495 2017-08-04 03:12:33.647414
> 952850'868978 2017-07-02 15:53:08.565948
> 19.608         0                  0        0         0       0  0
> 0                    0                              incomplete 2017-08-26
> 22:59:03.044649          0'0         1091150:428                  [258,338]
> 258                                         [258,338]            258
> 960118'862299 2017-08-04 03:01:57.011411     958900'861456 2017-07-28
> 02:33:29.476119
> 19.8bb         0                  0        0         0       0  0
> 0                    0                              incomplete 2017-08-26
> 22:59:02.946453          0'0         1091150:339                  [260,331]
> 260                                         [260,331]            260
> 960114'866811 2017-08-03 04:51:42.117840     952850'864443 2017-07-08
> 02:48:37.958357
> 19.dd3      5864                  0        0         0       0 25600089555
> 3094                 3094                                      incomplete
> 2017-08-26 17:20:07.948285     961411'865657    1091150:72381143
> [263,142]        263
> [263,142]          263     960118'865078 2017-08-25 17:32:06.181006
> 960118'865078 2017-08-25 17:32:06.181006
> 
> 
> I also noticed that some of those have 0 objects in them despite the dir in
> one of the osds have objects in it.
> these pools are replica 2
> 
> 
> thanks
> ali
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: High memory usage kills OSD while peering
  2017-08-27  2:00                             ` Sage Weil
@ 2017-08-29  7:44                               ` Mustafa Muhammad
  2017-08-29 19:34                                 ` Mustafa Muhammad
  0 siblings, 1 reply; 27+ messages in thread
From: Mustafa Muhammad @ 2017-08-29  7:44 UTC (permalink / raw)
  To: Sage Weil; +Cc: Linux Chips, ceph-devel

Hi all,
Not sure if I should open a new thread, but this is the same cluster,
so this should provide a little background.
Now the cluster is up and recovering, but we are hitting a bug that is
crashing the OSD

     0> 2017-08-29 10:00:51.699557 7fae66139700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.4/rpm/el7/BUILD/ceph-12.1.4/src/osd/ECUtil.cc:
In function 'int ECUtil::decode(const ECUtil::stripe_info_t&,
ceph::ErasureCodeInterfaceRef&, std::map<int, ceph::buffer::list>&,
std::map<int, ceph::buffer::list*>&)' thread 7fae66139700 time
2017-08-29 10:00:51.688625
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.4/rpm/el7/BUILD/ceph-12.1.4/src/osd/ECUtil.cc:
59: FAILED assert(i->second.length() == total_data_size)

Probably http://tracker.ceph.com/issues/14009

Some shards are problematic, smaller sizes (definitely a problem) or
last part of them is all zeros (not sure if this is padding or
problem).

Now we have set noup, marked OSDs with corrupt chunks down, and let
the recovery proceed, but this is happening in lots of PGs and is very
slow.
Is there anything we can do to fix this faster, we tried removing the
corrupted chunk? and got this crash (I grep the thread in which Abort
happened):

   -77> 2017-08-28 15:11:40.030178 7f90cd519700  0 osd.377 pg_epoch:
1102631 pg[143.1b0s0( v 1098703'309813 (960110'306653,1098703'309813]
local-lis/les=1102586/1102609 n=63499 ec=470378/470378 lis/c
1102586/960364 les/c/f 1102609/960364/1061015 1102545/1102586/1102586)
[377,77,248,635,642,111,182,234,531,307,29,648]/[377,77,248,198,529,111,182,234,548,307,29,174]
r=0 lpr=1102586 pi=[960339,1102586)/44 rops=1
bft=531(8),635(3),642(4),648(11) crt=1098703'309813 lcod 0'0 mlcod 0'0
active+remapped+backfilling] failed_push
143:0d9ce204:::default.63296332.1__shadow_2033460653.2~dpBlpEu3nMuFDe6ikBFMso5ivuBb7oj.1_93:head
from shard 548(8), reps on  unfound? 0
    -2> 2017-08-28 15:11:40.130722 7f90cd519700 -1 osd.377 pg_epoch:
1102631 pg[143.1b0s0( v 1098703'309813 (960110'306653,1098703'309813]
local-lis/les=1102586/1102609 n=63499 ec=470378/470378 lis/c
1102586/960364 les/c/f 1102609/960364/1061015 1102545/1102586/1102586)
[377,77,248,635,642,111,182,234,531,307,29,648]/[377,77,248,198,529,111,182,234,548,307,29,174]
r=0 lpr=1102586 pi=[960339,1102586)/44
bft=531(8),635(3),642(4),648(11) crt=1098703'309813 lcod 0'0 mlcod 0'0
active+remapped+backfilling] recover_replicas: object
143:0d9ce204:::default.63296332.1__shadow_2033460653.2~dpBlpEu3nMuFDe6ikBFMso5ivuBb7oj.1_93:head
last_backfill 143:0d9ce1c5:::default.63296332.1__shadow_26882237.2~mGGm_A45xKldAdADFC13qizbUiC0Yrw.1_158:head
    -1> 2017-08-28 15:11:40.130802 7f90cd519700 -1 osd.377 pg_epoch:
1102631 pg[143.1b0s0( v 1098703'309813 (960110'306653,1098703'309813]
local-lis/les=1102586/1102609 n=63499 ec=470378/470378 lis/c
1102586/960364 les/c/f 1102609/960364/1061015 1102545/1102586/1102586)
[377,77,248,635,642,111,182,234,531,307,29,648]/[377,77,248,198,529,111,182,234,548,307,29,174]
r=0 lpr=1102586 pi=[960339,1102586)/44
bft=531(8),635(3),642(4),648(11) crt=1098703'309813 lcod 0'0 mlcod 0'0
active+remapped+backfilling] recover_replicas: object added to missing
set for backfill, but is not in recovering, error!
     0> 2017-08-28 15:11:40.134768 7f90cd519700 -1 *** Caught signal
(Aborted) **
in thread 7f90cd519700 thread_name:tp_osd_tp

What we can do to fix this?
Will enabling fast_read on the pool benefit us or it is client only?
Any ideas?

Regards
Mustafa

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: High memory usage kills OSD while peering
  2017-08-29  7:44                               ` Mustafa Muhammad
@ 2017-08-29 19:34                                 ` Mustafa Muhammad
  2017-08-29 19:49                                   ` Linux Chips
  0 siblings, 1 reply; 27+ messages in thread
From: Mustafa Muhammad @ 2017-08-29 19:34 UTC (permalink / raw)
  To: Sage Weil; +Cc: Linux Chips, ceph-devel

I reported this issue, if you can take a look:

http://tracker.ceph.com/issues/21173

Regards
Mustafa

On Tue, Aug 29, 2017 at 10:44 AM, Mustafa Muhammad
<mustafa1024m@gmail.com> wrote:
> Hi all,
> Not sure if I should open a new thread, but this is the same cluster,
> so this should provide a little background.
> Now the cluster is up and recovering, but we are hitting a bug that is
> crashing the OSD
>
>      0> 2017-08-29 10:00:51.699557 7fae66139700 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.4/rpm/el7/BUILD/ceph-12.1.4/src/osd/ECUtil.cc:
> In function 'int ECUtil::decode(const ECUtil::stripe_info_t&,
> ceph::ErasureCodeInterfaceRef&, std::map<int, ceph::buffer::list>&,
> std::map<int, ceph::buffer::list*>&)' thread 7fae66139700 time
> 2017-08-29 10:00:51.688625
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.4/rpm/el7/BUILD/ceph-12.1.4/src/osd/ECUtil.cc:
> 59: FAILED assert(i->second.length() == total_data_size)
>
> Probably http://tracker.ceph.com/issues/14009
>
> Some shards are problematic, smaller sizes (definitely a problem) or
> last part of them is all zeros (not sure if this is padding or
> problem).
>
> Now we have set noup, marked OSDs with corrupt chunks down, and let
> the recovery proceed, but this is happening in lots of PGs and is very
> slow.
> Is there anything we can do to fix this faster, we tried removing the
> corrupted chunk? and got this crash (I grep the thread in which Abort
> happened):
>
>    -77> 2017-08-28 15:11:40.030178 7f90cd519700  0 osd.377 pg_epoch:
> 1102631 pg[143.1b0s0( v 1098703'309813 (960110'306653,1098703'309813]
> local-lis/les=1102586/1102609 n=63499 ec=470378/470378 lis/c
> 1102586/960364 les/c/f 1102609/960364/1061015 1102545/1102586/1102586)
> [377,77,248,635,642,111,182,234,531,307,29,648]/[377,77,248,198,529,111,182,234,548,307,29,174]
> r=0 lpr=1102586 pi=[960339,1102586)/44 rops=1
> bft=531(8),635(3),642(4),648(11) crt=1098703'309813 lcod 0'0 mlcod 0'0
> active+remapped+backfilling] failed_push
> 143:0d9ce204:::default.63296332.1__shadow_2033460653.2~dpBlpEu3nMuFDe6ikBFMso5ivuBb7oj.1_93:head
> from shard 548(8), reps on  unfound? 0
>     -2> 2017-08-28 15:11:40.130722 7f90cd519700 -1 osd.377 pg_epoch:
> 1102631 pg[143.1b0s0( v 1098703'309813 (960110'306653,1098703'309813]
> local-lis/les=1102586/1102609 n=63499 ec=470378/470378 lis/c
> 1102586/960364 les/c/f 1102609/960364/1061015 1102545/1102586/1102586)
> [377,77,248,635,642,111,182,234,531,307,29,648]/[377,77,248,198,529,111,182,234,548,307,29,174]
> r=0 lpr=1102586 pi=[960339,1102586)/44
> bft=531(8),635(3),642(4),648(11) crt=1098703'309813 lcod 0'0 mlcod 0'0
> active+remapped+backfilling] recover_replicas: object
> 143:0d9ce204:::default.63296332.1__shadow_2033460653.2~dpBlpEu3nMuFDe6ikBFMso5ivuBb7oj.1_93:head
> last_backfill 143:0d9ce1c5:::default.63296332.1__shadow_26882237.2~mGGm_A45xKldAdADFC13qizbUiC0Yrw.1_158:head
>     -1> 2017-08-28 15:11:40.130802 7f90cd519700 -1 osd.377 pg_epoch:
> 1102631 pg[143.1b0s0( v 1098703'309813 (960110'306653,1098703'309813]
> local-lis/les=1102586/1102609 n=63499 ec=470378/470378 lis/c
> 1102586/960364 les/c/f 1102609/960364/1061015 1102545/1102586/1102586)
> [377,77,248,635,642,111,182,234,531,307,29,648]/[377,77,248,198,529,111,182,234,548,307,29,174]
> r=0 lpr=1102586 pi=[960339,1102586)/44
> bft=531(8),635(3),642(4),648(11) crt=1098703'309813 lcod 0'0 mlcod 0'0
> active+remapped+backfilling] recover_replicas: object added to missing
> set for backfill, but is not in recovering, error!
>      0> 2017-08-28 15:11:40.134768 7f90cd519700 -1 *** Caught signal
> (Aborted) **
> in thread 7f90cd519700 thread_name:tp_osd_tp
>
> What we can do to fix this?
> Will enabling fast_read on the pool benefit us or it is client only?
> Any ideas?
>
> Regards
> Mustafa

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: High memory usage kills OSD while peering
  2017-08-29 19:34                                 ` Mustafa Muhammad
@ 2017-08-29 19:49                                   ` Linux Chips
  0 siblings, 0 replies; 27+ messages in thread
From: Linux Chips @ 2017-08-29 19:49 UTC (permalink / raw)
  To: Mustafa Muhammad, Sage Weil; +Cc: ceph-devel

hi, this would not be normally an issue.
but i think the whole thing of oom killing them and nodes dying, made 
the osds do alot of errors when writing files to disk. so we are seeing 
100s of those files till now. and not sure how much still there is to fix.
we had to do "ceph osd set pause" to keep recovery moving, other wise it 
is a mess.
i am willing to patch this if any one have a nice idea on how to deal 
with it as i am not sure what is the best to do.

my idea was (not sure how easy to implement) is to check if we have a 
size mismatch, we then grab all the chunks, and take enough shards with 
matching sizes, and decode them.
and probably mark the pg inconsistent, and let the repair deal with it 
when the pg finish recovering.

On 08/29/2017 10:34 PM, Mustafa Muhammad wrote:
> I reported this issue, if you can take a look:
> 
> http://tracker.ceph.com/issues/21173
> 
> Regards
> Mustafa
> 
> On Tue, Aug 29, 2017 at 10:44 AM, Mustafa Muhammad
> <mustafa1024m@gmail.com> wrote:
>> Hi all,
>> Not sure if I should open a new thread, but this is the same cluster,
>> so this should provide a little background.
>> Now the cluster is up and recovering, but we are hitting a bug that is
>> crashing the OSD
>>
>>       0> 2017-08-29 10:00:51.699557 7fae66139700 -1
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.4/rpm/el7/BUILD/ceph-12.1.4/src/osd/ECUtil.cc:
>> In function 'int ECUtil::decode(const ECUtil::stripe_info_t&,
>> ceph::ErasureCodeInterfaceRef&, std::map<int, ceph::buffer::list>&,
>> std::map<int, ceph::buffer::list*>&)' thread 7fae66139700 time
>> 2017-08-29 10:00:51.688625
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.4/rpm/el7/BUILD/ceph-12.1.4/src/osd/ECUtil.cc:
>> 59: FAILED assert(i->second.length() == total_data_size)
>>
>> Probably http://tracker.ceph.com/issues/14009
>>
>> Some shards are problematic, smaller sizes (definitely a problem) or
>> last part of them is all zeros (not sure if this is padding or
>> problem).
>>
>> Now we have set noup, marked OSDs with corrupt chunks down, and let
>> the recovery proceed, but this is happening in lots of PGs and is very
>> slow.
>> Is there anything we can do to fix this faster, we tried removing the
>> corrupted chunk? and got this crash (I grep the thread in which Abort
>> happened):
>>
>>     -77> 2017-08-28 15:11:40.030178 7f90cd519700  0 osd.377 pg_epoch:
>> 1102631 pg[143.1b0s0( v 1098703'309813 (960110'306653,1098703'309813]
>> local-lis/les=1102586/1102609 n=63499 ec=470378/470378 lis/c
>> 1102586/960364 les/c/f 1102609/960364/1061015 1102545/1102586/1102586)
>> [377,77,248,635,642,111,182,234,531,307,29,648]/[377,77,248,198,529,111,182,234,548,307,29,174]
>> r=0 lpr=1102586 pi=[960339,1102586)/44 rops=1
>> bft=531(8),635(3),642(4),648(11) crt=1098703'309813 lcod 0'0 mlcod 0'0
>> active+remapped+backfilling] failed_push
>> 143:0d9ce204:::default.63296332.1__shadow_2033460653.2~dpBlpEu3nMuFDe6ikBFMso5ivuBb7oj.1_93:head
>> from shard 548(8), reps on  unfound? 0
>>      -2> 2017-08-28 15:11:40.130722 7f90cd519700 -1 osd.377 pg_epoch:
>> 1102631 pg[143.1b0s0( v 1098703'309813 (960110'306653,1098703'309813]
>> local-lis/les=1102586/1102609 n=63499 ec=470378/470378 lis/c
>> 1102586/960364 les/c/f 1102609/960364/1061015 1102545/1102586/1102586)
>> [377,77,248,635,642,111,182,234,531,307,29,648]/[377,77,248,198,529,111,182,234,548,307,29,174]
>> r=0 lpr=1102586 pi=[960339,1102586)/44
>> bft=531(8),635(3),642(4),648(11) crt=1098703'309813 lcod 0'0 mlcod 0'0
>> active+remapped+backfilling] recover_replicas: object
>> 143:0d9ce204:::default.63296332.1__shadow_2033460653.2~dpBlpEu3nMuFDe6ikBFMso5ivuBb7oj.1_93:head
>> last_backfill 143:0d9ce1c5:::default.63296332.1__shadow_26882237.2~mGGm_A45xKldAdADFC13qizbUiC0Yrw.1_158:head
>>      -1> 2017-08-28 15:11:40.130802 7f90cd519700 -1 osd.377 pg_epoch:
>> 1102631 pg[143.1b0s0( v 1098703'309813 (960110'306653,1098703'309813]
>> local-lis/les=1102586/1102609 n=63499 ec=470378/470378 lis/c
>> 1102586/960364 les/c/f 1102609/960364/1061015 1102545/1102586/1102586)
>> [377,77,248,635,642,111,182,234,531,307,29,648]/[377,77,248,198,529,111,182,234,548,307,29,174]
>> r=0 lpr=1102586 pi=[960339,1102586)/44
>> bft=531(8),635(3),642(4),648(11) crt=1098703'309813 lcod 0'0 mlcod 0'0
>> active+remapped+backfilling] recover_replicas: object added to missing
>> set for backfill, but is not in recovering, error!
>>       0> 2017-08-28 15:11:40.134768 7f90cd519700 -1 *** Caught signal
>> (Aborted) **
>> in thread 7f90cd519700 thread_name:tp_osd_tp
>>
>> What we can do to fix this?
>> Will enabling fast_read on the pool benefit us or it is client only?
>> Any ideas?
>>
>> Regards
>> Mustafa


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2017-08-29 19:49 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-17 14:13 High memory usage kills OSD while peering Linux Chips
2017-08-17 17:53 ` Gregory Farnum
2017-08-17 18:51   ` Linux Chips
2017-08-19 16:38     ` Mustafa Muhammad
2017-08-22 22:33       ` Sage Weil
2017-08-23 12:21         ` Linux Chips
2017-08-23 13:46           ` Sage Weil
2017-08-23 15:27             ` Linux Chips
2017-08-24  3:58               ` Sage Weil
2017-08-25 22:25                 ` Linux Chips
2017-08-25 22:46                   ` Sage Weil
2017-08-25 22:49                     ` Sage Weil
2017-08-25 23:03                       ` Linux Chips
2017-08-25 23:03                     ` Linux Chips
2017-08-25 23:08                       ` Sage Weil
2017-08-26 12:13                         ` Linux Chips
2017-08-26 21:17                           ` Linux Chips
2017-08-27  2:00                             ` Sage Weil
2017-08-29  7:44                               ` Mustafa Muhammad
2017-08-29 19:34                                 ` Mustafa Muhammad
2017-08-29 19:49                                   ` Linux Chips
     [not found]     ` <CAGtbiz1eHTiaO4pWu4sU97E8N+=DthTXjbY_Ga9CONW862y2XQ@mail.gmail.com>
2017-08-21 10:48       ` Linux Chips
2017-08-21 10:57     ` Linux Chips
2017-08-21 13:07       ` Haomai Wang
     [not found]         ` <93debf2d-12cb-eceb-e9cd-5226ad49cc16@gmail.com>
2017-08-21 15:18           ` Haomai Wang
2017-08-21 16:05             ` Mustafa Muhammad
2017-08-22  8:37             ` Linux Chips

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.