From mboxrd@z Thu Jan 1 00:00:00 1970 From: Vladimir Bashkirtsev Subject: Re: PGs stuck in creating state Date: Tue, 22 May 2012 21:55:59 +0930 Message-ID: <4FBB85D7.1030203@bashkirtsev.com> References: <4F90DCBC.4030104@bashkirtsev.com> <4FA71D55.6050902@bashkirtsev.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mail.logics.net.au ([150.101.56.178]:36517 "EHLO mail.logics.net.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750852Ab2EVM0s (ORCPT ); Tue, 22 May 2012 08:26:48 -0400 In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Sage Weil Cc: ceph-devel@vger.kernel.org On 08/05/12 01:26, Sage Weil wrote: > On Mon, 7 May 2012, Vladimir Bashkirtsev wrote: >> On 20/04/12 14:41, Sage Weil wrote: >>> On Fri, 20 Apr 2012, Vladimir Bashkirtsev wrote: >>>> Dear devs, >>>> >>>> First of all I would like to bow my head at your great effort! Even if >>>> ceph >>>> did not reach prime time status yet it is already extremely powerful and >>>> fairly stable to the point we have deployed it in live environment (still >>>> backing up of course). >>>> >>>> I have played with ceph extensively to put it different failed modes and >>>> it >>>> recovers in most cases without major issues. However when I was adding up >>>> OSDs >>>> I have got 2 PGs stuck in creating mode and does not matter what I have >>>> done >>>> (changing CRUSH, restarting OSDs in order to get ceph to move data around) >>>> they still there. So instead of using scientific blindfold method I >>>> decided to >>>> dig deeper. So here we go (please disregard stuck active+remapped as it is >>>> part of another problem): >>>> >>>> [root@gamma tmp]# ceph -s >>>> 2012-04-20 12:40:45.294969 pg v625626: 600 pgs: 2 creating, 586 >>>> active+clean, 12 active+remapped; 253 GB data, 767 GB used, 1292 GB / 2145 >>>> GB >>>> avail >>>> 2012-04-20 12:40:45.299426 mds e457: 1/1/1 up {0=2=up:active}, 2 >>>> up:standby >>>> 2012-04-20 12:40:45.299549 osd e6022: 6 osds: 6 up, 6 in >>>> 2012-04-20 12:40:45.299856 log 2012-04-20 12:26:43.988716 mds.0 >>>> 172.16.64.202:6802/21363 1 : [DBG] reconnect by client.13897 >>>> 172.16.64.10:0/1319861516 after 0.101633 >>>> 2012-04-20 12:40:45.300160 mon e1: 3 mons at >>>> {0=172.16.64.200:6789/0,1=172.16.64.201:6789/0,2=172.16.64.202:6789/0} >>>> >>>> [root@gamma tmp]# ceph pg dump | grep creating >>>> 2.1p3 0 0 0 0 0 0 0 creating >>>> 0.000000 0'0 0'0 [] [] 0'0 0.000000 >>>> 1.1p3 0 0 0 0 0 0 0 creating >>>> 0.000000 0'0 0'0 [] [] 0'0 0.000000 >>> Sigh.. this is exactly the problem I noticed last week that prompted the >>> 'localized pgs' email. I'm working on patches to remove this >>> functionality entirely as we speak, since it's a broken design in several >>> ways. >>> >>> Once you restart your osds (be sure to control-c that 'ceph' command >>> first), you should be able to wipe out those bad pgs with >>> >>> ceph osd pool disable_lpgs --yes-i-really-mean-it >>> >>> This is in current master, and will be included in 0.46. >> Have upgraded to 0.46, done disable_lpgs on all pools - still no banana. Other >> ideas? > The stray pg entries will get zapped by v0.47. The 'disable_lpgs' will > only prevent new ones from getting created<0.47. > > sage 0.47 on board and problem has gone! > > >>> Alternatively, you can just ignore those pgs for the time being. They >>> are completely harmless as long as you can tolerate the ceph -s/ceph >>> health noise. v0.47 should silently zap them all on upgrade. >>> >>> sage >>> >>>> My understanding is that p3 means that PG creating on osd.3 . I've tried >>>> to >>>> stop and restart osd.3 - no banana. So I went to more dramatic option: >>>> lose >>>> osd.3 . Completely destroyed osd.3 and rebuilt it from scratch. osd.3 came >>>> back again with exactly the same PGs in creating mode - makes me think >>>> that >>>> that osd.3 is not responsible for this. >>>> >>>> [root@gamma tmp]# ceph osd dump >>>> dumped osdmap epoch 6022 >>>> epoch 6022 >>>> fsid 7719f573-4c48-4852-a27f-51c7a3fe1c1e >>>> created 2012-03-31 04:47:12.130128 >>>> modifed 2012-04-20 12:26:56.406193 >>>> flags >>>> >>>> pool 0 'data' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num 192 >>>> pgp_num 192 lpg_num 2 lpgp_num 2 last_change 1137 owner 0 >>>> crash_replay_interval 45 >>>> pool 1 'metadata' rep size 3 crush_ruleset 1 object_hash rjenkins pg_num >>>> 192 >>>> pgp_num 192 lpg_num 2 lpgp_num 2 last_change 1160 owner 0 >>>> pool 2 'rbd' rep size 3 crush_ruleset 2 object_hash rjenkins pg_num 192 >>>> pgp_num 192 lpg_num 2 lpgp_num 2 last_change 1141 owner 0 >>>> >>>> max_osd 6 >>>> osd.0 up in weight 1 up_from 6001 up_thru 6018 down_at 6000 >>>> last_clean_interval [5980,5996) 172.16.64.200:6804/23046 >>>> 172.16.64.200:6805/23046 172.16.64.200:6806/23046 exists,up >>>> osd.1 up in weight 1 up_from 5998 up_thru 6017 down_at 5997 >>>> last_clean_interval [5992,5996) 172.16.64.201:6804/27598 >>>> 172.16.64.201:6805/27598 172.16.64.201:6806/27598 exists,up >>>> osd.2 up in weight 1 up_from 5998 up_thru 6019 down_at 5997 >>>> last_clean_interval [5978,5996) 172.16.64.202:6804/21457 >>>> 172.16.64.202:6805/21457 172.16.64.202:6806/21457 exists,up >>>> osd.3 up in weight 1 up_from 5972 up_thru 6017 down_at 5970 >>>> last_clean_interval [5884,5969) lost_at 1163 172.16.64.203:6800/10614 >>>> 172.16.64.203:6801/10614 172.16.64.203:6802/10614 exists,up >>>> osd.4 up in weight 1 up_from 5995 up_thru 6017 down_at 5988 >>>> last_clean_interval [5898,5987) 172.16.64.204:6800/16357 >>>> 172.16.64.204:6801/16357 172.16.64.204:6802/16357 exists,up >>>> osd.5 up in weight 1 up_from 5984 up_thru 6021 down_at 5982 >>>> last_clean_interval [5921,5981) 172.16.64.205:6800/11346 >>>> 172.16.64.205:6801/11346 172.16.64.205:6803/11346 exists,up >>>> >>>> pg_temp 0.25 [0,1,4] >>>> pg_temp 0.33 [0,1,5] >>>> pg_temp 0.3e [1,0,4] >>>> pg_temp 0.48 [1,0,4] >>>> pg_temp 1.24 [0,1,4] >>>> pg_temp 1.32 [0,1,5] >>>> pg_temp 1.3d [1,0,4] >>>> pg_temp 1.47 [1,0,2] >>>> pg_temp 2.23 [0,1,4] >>>> pg_temp 2.31 [0,1,5] >>>> pg_temp 2.3c [1,0,4] >>>> pg_temp 2.46 [1,0,4] >>>> blacklist 172.16.64.202:6800/19544 expires 2012-04-20 12:45:43.842394 >>>> blacklist 172.16.64.200:6800/22946 expires 2012-04-20 12:50:17.341910 >>>> blacklist 172.16.64.201:6803/27504 expires 2012-04-20 12:50:20.102205 >>>> >>>> Then I have attempted to find out what's going on with 2.1p3: >>>> >>>> [root@delta ceph]# ceph pg 2.1p3 query >>>> >>>> at this point nothing did happen - it just sits there. At least it seems >>>> so >>>> but in reality: >>>> >>>> [root@delta ceph]# ceph -s >>>> 2012-04-20 12:49:50.322103 pg v625936: 600 pgs: 2 creating, 4 stale, >>>> 400 >>>> active+degraded, 3 stale+peering, 174 stale+active+degraded, 1 >>>> down+remapped+peering, 4 active+degraded+remapped, 1 >>>> stale+remapped+peering, 3 >>>> stale+active+replay+degraded, 8 stale+active+degraded+remapped; 253 GB >>>> data, >>>> 768 GB used, 1292 GB / 2145 GB avail; 138289/230205 degraded (60.072%) >>>> 2012-04-20 12:49:50.326769 mds e457: 1/1/1 up {0=2=up:active}, 2 >>>> up:standby >>>> 2012-04-20 12:49:50.326886 osd e6042: 6 osds: 1 up, 6 in >>>> 2012-04-20 12:49:50.327226 log 2012-04-20 12:49:16.732954 mon.0 >>>> 172.16.64.200:6789/0 35 : [INF] osd.4 172.16.64.204:6800/16357 failed (by >>>> osd.5 172.16.64.205:6800/11346) >>>> 2012-04-20 12:49:50.327497 mon e1: 3 mons at >>>> {0=172.16.64.200:6789/0,1=172.16.64.201:6789/0,2=172.16.64.202:6789/0} >>>> >>>> So as you can see 5 out of 6 OSDs crashed. Restarting all crashed OSDs >>>> manually gets ceph back on the feet without any damage - minute later >>>> everything runs as nothing happened and it is good time to see the logs. >>>> In >>>> particular osd.3 log: >>>> >>>> 2012-04-20 12:47:38.944653 aadd6b40 --OSD::tracker-- seq: 56, time: >>>> 2012-04-20 >>>> 12:47:38.944641, event: started pg_info(1 pgs e6026) v1 >>>> 2012-04-20 12:47:38.965044 aadd6b40 --OSD::tracker-- seq: 57, time: >>>> 2012-04-20 >>>> 12:47:38.965017, event: waiting_for_osdmap pg_info(1 pgs e6025) v1 >>>> 2012-04-20 12:47:38.965098 aadd6b40 --OSD::tracker-- seq: 57, time: >>>> 2012-04-20 >>>> 12:47:38.965086, event: started pg_info(1 pgs e6025) v1 >>>> 2012-04-20 12:47:38.965179 aadd6b40 --OSD::tracker-- seq: 58, time: >>>> 2012-04-20 >>>> 12:47:38.965165, event: waiting_for_osdmap pg_log(1.8 epoch 6026 >>>> query_epoch >>>> 6026) v2 >>>> 2012-04-20 12:47:38.965215 aadd6b40 --OSD::tracker-- seq: 58, time: >>>> 2012-04-20 >>>> 12:47:38.965202, event: started pg_log(1.8 epoch 6026 query_epoch 6026) v2 >>>> 2012-04-20 12:47:38.965421 aadd6b40 --OSD::tracker-- seq: 59, time: >>>> 2012-04-20 >>>> 12:47:38.965408, event: waiting_for_osdmap pg_info(1 pgs e6026) v1 >>>> 2012-04-20 12:47:38.965461 aadd6b40 --OSD::tracker-- seq: 59, time: >>>> 2012-04-20 >>>> 12:47:38.965449, event: started pg_info(1 pgs e6026) v1 >>>> 2012-04-20 12:47:38.990427 ab5d7b40 --OSD::tracker-- seq: 60, time: >>>> 2012-04-20 >>>> 12:47:38.990388, event: waiting_for_osdmap osd_pg_create(pg1.1p3,619; >>>> pg2.1p3,1141; ) v2 >>>> 2012-04-20 12:47:38.990592 aadd6b40 --OSD::tracker-- seq: 61, time: >>>> 2012-04-20 >>>> 12:47:38.990579, event: waiting_for_osdmap pg_info(1 pgs e6026) v1 >>>> 2012-04-20 12:47:38.994296 aadd6b40 --OSD::tracker-- seq: 61, time: >>>> 2012-04-20 >>>> 12:47:38.993072, event: started pg_info(1 pgs e6026) v1 >>>> 2012-04-20 12:47:39.077303 ab5d7b40 --OSD::tracker-- seq: 60, time: >>>> 2012-04-20 >>>> 12:47:39.077276, event: started osd_pg_create(pg1.1p3,619; pg2.1p3,1141; ) >>>> v2 >>>> 2012-04-20 12:47:54.546810 ab5d7b40 --OSD::tracker-- seq: 62, time: >>>> 2012-04-20 >>>> 12:47:54.546770, event: waiting_for_osdmap osd_op(client.13891.0:90331 >>>> rb.0.1.000000000b40 [write 3661824~4096] 2.1f179073) v4 >>>> osd/OSD.cc: In function 'PG* OSD::_lookup_lock_pg(pg_t)' thread a65cdb40 >>>> time >>>> 2012-04-20 12:47:54.562370 >>>> osd/OSD.cc: 1162: FAILED assert(pg_map.count(pgid)) >>>> ceph version 0.45 (commit:0aea1cb1df5c3e5ab783ca6f2ed7649823b613c5) >>>> 1: /usr/bin/ceph-osd() [0x81f3b04] >>>> 2: (OSD::do_command(Connection*, unsigned long long, >>>> std::vector>>> std::allocator >&, ceph::buffer::list&)+0x1381) [0x8202b21] >>>> 3: (OSD::CommandWQ::_process(OSD::Command*)+0x51) [0x823ac71] >>>> 4: (ThreadPool::WorkQueue::_void_process(void*)+0x18) >>>> [0x8224758] >>>> 5: (ThreadPool::worker()+0x42d) [0x82f8ddd] >>>> 6: (ThreadPool::WorkThread::entry()+0x1f) [0x822559f] >>>> 7: (Thread::_entry_func(void*)+0x10) [0x82f3140] >>>> 8: (()+0x6cd3) [0x6e0cd3] >>>> 9: (clone()+0x5e) [0xa3ca2e] >>>> ceph version 0.45 (commit:0aea1cb1df5c3e5ab783ca6f2ed7649823b613c5) >>>> 1: /usr/bin/ceph-osd() [0x81f3b04] >>>> 2: (OSD::do_command(Connection*, unsigned long long, >>>> std::vector>>> std::allocator >&, ceph::buffer::list&)+0x1381) [0x8202b21] >>>> 3: (OSD::CommandWQ::_process(OSD::Command*)+0x51) [0x823ac71] >>>> 4: (ThreadPool::WorkQueue::_void_process(void*)+0x18) >>>> [0x8224758] >>>> 5: (ThreadPool::worker()+0x42d) [0x82f8ddd] >>>> 6: (ThreadPool::WorkThread::entry()+0x1f) [0x822559f] >>>> 7: (Thread::_entry_func(void*)+0x10) [0x82f3140] >>>> 8: (()+0x6cd3) [0x6e0cd3] >>>> 9: (clone()+0x5e) [0xa3ca2e] >>>> *** Caught signal (Aborted) ** >>>> in thread a65cdb40 >>>> ceph version 0.45 (commit:0aea1cb1df5c3e5ab783ca6f2ed7649823b613c5) >>>> 1: /usr/bin/ceph-osd() [0x83754db] >>>> 2: [0x8c0400] >>>> 3: [0x8c0416] >>>> 4: (gsignal()+0x4f) [0x97898f] >>>> 5: (abort()+0x175) [0x97a2d5] >>>> 6: (__gnu_cxx::__verbose_terminate_handler()+0x14d) [0x7a754d] >>>> 7: (()+0xb02c4) [0x7a52c4] >>>> 8: (()+0xb0300) [0x7a5300] >>>> 9: (()+0xb044f) [0x7a544f] >>>> 10: (ceph::__ceph_assert_fail(char const*, char const*, int, char >>>> const*)+0x255) [0x8300715] >>>> 11: /usr/bin/ceph-osd() [0x81f3b04] >>>> 12: (OSD::do_command(Connection*, unsigned long long, >>>> std::vector >&, >>>> ceph::buffer::list&)+0x1381) [0x8202b21] >>>> 13: (OSD::CommandWQ::_process(OSD::Command*)+0x51) [0x823ac71] >>>> 14: (ThreadPool::WorkQueue::_void_process(void*)+0x18) >>>> [0x8224758] >>>> 15: (ThreadPool::worker()+0x42d) [0x82f8ddd] >>>> 16: (ThreadPool::WorkThread::entry()+0x1f) [0x822559f] >>>> 17: (Thread::_entry_func(void*)+0x10) [0x82f3140] >>>> 18: (()+0x6cd3) [0x6e0cd3] >>>> 19: (clone()+0x5e) [0xa3ca2e] >>>> >>>> So there's couple of things to note: >>>> 1. At 2012-04-20 12:47:39.077303 osd started creating PGs in question >>>> 2. At 2012-04-20 12:47:54.546810 we got FAILED assert(pg_map.count(pgid)) >>>> >>>> And at this point I guess it is over to you guys as I do not have any idea >>>> why >>>> this assert fails. If you point me in right direction I would be eternally >>>> grateful! :) >>>> >>>> Regards, >>>> Vladimir >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>> the body of a message to majordomo@vger.kernel.org >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> >>>> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >>