From mboxrd@z Thu Jan 1 00:00:00 1970 From: Varada Kari Subject: Re: Bluestore assert Date: Mon, 22 Aug 2016 08:02:20 +0000 Message-ID: References: <7dc67e25-4e1c-09a1-8667-ee47572b9290@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Return-path: Received: from mail-bn3nam01on0040.outbound.protection.outlook.com ([104.47.33.40]:37996 "EHLO NAM01-BN3-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751550AbcHVICZ (ORCPT ); Mon, 22 Aug 2016 04:02:25 -0400 Content-Language: en-US Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Somnath Roy , Sage Weil Cc: Mark Nelson , ceph-devel Ideally, we should have enough space already allocated for the log, but=0A= here are trying to grow the log file, hence the assert.=0A= Are there any log relevant log messages about allocation in the log from=0A= bluefs?=0A= =0A= Varada=0A= =0A= On Monday 22 August 2016 01:04 PM, Somnath Roy wrote:=0A= > Sage,=0A= > Got the following asserts on two different path with the latest master.= =0A= >=0A= > 1.=0A= > os/bluestore/BlueFS.cc: 1377: FAILED assert(h->file->fnode.ino !=3D 1)=0A= >=0A= > ceph version 11.0.0-1688-g6f48ee6 (6f48ee6bc5c85f44d7ca4c984f9bef1339c2b= ea4)=0A= > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)= +0x80) [0x55f0d46f9cd0]=0A= > 2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned lo= ng)+0x1bed) [0x55f0d43cb34d]=0A= > 3: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xa7) [0x55f0d43cb467]=0A= > 4: (BlueFS::_flush_and_sync_log(std::unique_lock&, unsigned = long, unsigned long)+0x3b2) [0x55f0d43ccf12]=0A= > 5: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock&)+0= x35b) [0x55f0d43ce2fb]=0A= > 6: (BlueRocksWritableFile::Sync()+0x62) [0x55f0d43e5c32]=0A= > 7: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x2d1) [0x55f0d456f4= f1]=0A= > 8: (rocksdb::WritableFileWriter::Sync(bool)+0xf0) [0x55f0d45709a0]=0A= > 9: (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status c= onst&, rocksdb::CompactionJob::SubcompactionState*)+0x4e6) [0x55f0d45b2506]= =0A= > 10: (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::Compacti= onJob::SubcompactionState*)+0x14ea) [0x55f0d45b4cca]=0A= > 11: (rocksdb::CompactionJob::Run()+0x479) [0x55f0d45b5c49]=0A= > 12: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, = rocksdb::LogBuffer*, void*)+0x9c0) [0x55f0d44a4610]=0A= > 13: (rocksdb::DBImpl::BackgroundCallCompaction(void*)+0xbf) [0x55f0d44b1= 47f]=0A= > 14: (rocksdb::ThreadPool::BGThread(unsigned long)+0x1d9) [0x55f0d4568079= ]=0A= > 15: (()+0x98f113) [0x55f0d4568113]=0A= > 16: (()+0x76fa) [0x7f06101576fa]=0A= > 17: (clone()+0x6d) [0x7f060dfb7b5d]=0A= >=0A= >=0A= > 2.=0A= >=0A= > 5700 time 2016-08-21 23:15:50.962450=0A= > os/bluestore/BlueFS.cc: 1377: FAILED assert(h->file->fnode.ino !=3D 1)=0A= >=0A= > ceph version 11.0.0-1688-g6f48ee6 (6f48ee6bc5c85f44d7ca4c984f9bef1339c2b= ea4)=0A= > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)= +0x80) [0x55d9959bfcd0]=0A= > 2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned lo= ng)+0x1bed) [0x55d99569134d]=0A= > 3: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xa7) [0x55d995691467]=0A= > 4: (BlueFS::_flush_and_sync_log(std::unique_lock&, unsigned = long, unsigned long)+0x3b2) [0x55d995692f12]=0A= > 5: (BlueFS::sync_metadata()+0x1c3) [0x55d995697c33]=0A= > 6: (BlueRocksDirectory::Fsync()+0xd) [0x55d9956ab98d]=0A= > 7: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::Wr= iteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool)+0x= 13fa) [0x55d995778c2a]=0A= > 8: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteB= atch*)+0x2a) [0x55d9957797aa]=0A= > 9: (RocksDBStore::submit_transaction_sync(std::shared_ptr)+0x6b) [0x55d99573cb5b]=0A= > 10: (BlueStore::_kv_sync_thread()+0x1745) [0x55d995589e65]=0A= > 11: (BlueStore::KVSyncThread::entry()+0xd) [0x55d9955b754d]=0A= > 12: (Thread::entry_wrapper()+0x75) [0x55d99599f5e5]=0A= > 13: (()+0x76fa) [0x7f8a0bdde6fa]=0A= > 14: (clone()+0x6d) [0x7f8a09c3eb5d]=0A= >=0A= >=0A= > I saw this assert is newly introduced in the code.=0A= > FYI, I was running rocksdb by enabling universal style compaction during = this time.=0A= >=0A= > Thanks & Regards=0A= > Somnath=0A= >=0A= > -----Original Message-----=0A= > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kern= el.org] On Behalf Of Somnath Roy=0A= > Sent: Tuesday, August 16, 2016 12:45 PM=0A= > To: Sage Weil=0A= > Cc: Mark Nelson; ceph-devel=0A= > Subject: RE: Bluestore assert=0A= >=0A= > Sage,=0A= > The replay bug *is fixed* with your patch. I am able to make the OSDs (an= d cluster) up after hitting the db assertion bug.=0A= > Presently, I am trying to root cause the debug the db assertion issue.=0A= >=0A= > Thanks & Regards=0A= > Somnath=0A= >=0A= > -----Original Message-----=0A= > From: Sage Weil [mailto:sweil@redhat.com]=0A= > Sent: Monday, August 15, 2016 12:54 PM=0A= > To: Somnath Roy=0A= > Cc: Mark Nelson; ceph-devel=0A= > Subject: RE: Bluestore assert=0A= >=0A= > On Sun, 14 Aug 2016, Somnath Roy wrote:=0A= >> Sage,=0A= >> I did this..=0A= >>=0A= >> root@emsnode5:~/ceph-master/src# git diff diff --git =0A= >> a/src/kv/RocksDBStore.cc b/src/kv/RocksDBStore.cc index=0A= >> 638d231..bcf0935 100644=0A= >> --- a/src/kv/RocksDBStore.cc=0A= >> +++ b/src/kv/RocksDBStore.cc=0A= >> @@ -370,6 +370,10 @@ int RocksDBStore::submit_transaction(KeyValueDB::Tr= ansaction t)=0A= >> utime_t lat =3D ceph_clock_now(g_ceph_context) - start;=0A= >> logger->inc(l_rocksdb_txns);=0A= >> logger->tinc(l_rocksdb_submit_latency, lat);=0A= >> + if (!s.ok()) {=0A= >> + derr << __func__ << " error: " << s.ToString()=0A= >> + << "code =3D " << s.code() << dendl; }=0A= >> return s.ok() ? 0 : -1;=0A= >> }=0A= >>=0A= >> @@ -385,6 +389,11 @@ int RocksDBStore::submit_transaction_sync(KeyValueD= B::Transaction t)=0A= >> utime_t lat =3D ceph_clock_now(g_ceph_context) - start;=0A= >> logger->inc(l_rocksdb_txns_sync);=0A= >> logger->tinc(l_rocksdb_submit_sync_latency, lat);=0A= >> + if (!s.ok()) {=0A= >> + derr << __func__ << " error: " << s.ToString()=0A= >> + << "code =3D " << s.code() << dendl; }=0A= >> +=0A= >> return s.ok() ? 0 : -1;=0A= >> }=0A= >> int RocksDBStore::get_info_log_level(string info_log_level) @@ -442,7= =0A= >> +451,8 @@ void RocksDBStore::RocksDBTransactionImpl::rmkey(const=0A= >> string &prefix, void RocksDBStore::RocksDBTransactionImpl::rm_single_ke= y(const string &prefix,=0A= >> const string=0A= >> &k) {=0A= >> - bat->SingleDelete(combine_strings(prefix, k));=0A= >> + //bat->SingleDelete(combine_strings(prefix, k));=0A= >> + bat->Delete(combine_strings(prefix, k));=0A= >> }=0A= >>=0A= >> But, the db crash is still happening with the following log message.=0A= >>=0A= >> rocksdb: submit_transaction_sync error: NotFound: code =3D 1=0A= >>=0A= >> It seems it is not related to rm_single_key as I am hitting this from h= ttps://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L5108= as well where rm_single_key is not called.=0A= >> May be I should dump the transaction and see what's in there ?=0A= > Yeah. Unfortunately I think it isn't trivial to dump the kv transactions= because they're being constructed by rocksdb (WriteBack or something). = =0A= > Not sure if there is a dump for that (I'm guessing not?). You'd need to = write one, or build a kludgey lookaside map that can be dumped.=0A= > =0A= >> I am hitting the BlueFS replay bug I mentioned earlier and applied your = patch (https://github.com/ceph/ceph/pull/10686) but not helping.=0A= >> Is it because I needed to run with this patch from the beginning and not= just during replay ?=0A= > Yeah, the bug happens before replay.. we are writing a bad entry into the= bluefs log.=0A= >=0A= > sage=0A= >=0A= >=0A= >> Thanks & Regards=0A= >> Somnath=0A= >>=0A= >> -----Original Message-----=0A= >> From: Sage Weil [mailto:sweil@redhat.com]=0A= >> Sent: Thursday, August 11, 2016 3:32 PM=0A= >> To: Somnath Roy=0A= >> Cc: Mark Nelson; ceph-devel=0A= >> Subject: RE: Bluestore assert=0A= >>=0A= >> On Thu, 11 Aug 2016, Somnath Roy wrote:=0A= >>> Sage,=0A= >>> Regarding the db assert , I hit that again on multiple OSDs while I was= populating 40TB rbd images (~35TB written before crash).=0A= >>> I did the following changes in the code..=0A= >>>=0A= >>> @@ -370,7 +370,7 @@ int RocksDBStore::submit_transaction(KeyValueDB::Tr= ansaction t)=0A= >>> utime_t lat =3D ceph_clock_now(g_ceph_context) - start;=0A= >>> logger->inc(l_rocksdb_txns);=0A= >>> logger->tinc(l_rocksdb_submit_latency, lat);=0A= >>> - return s.ok() ? 0 : -1;=0A= >>> + return s.ok() ? 0 : -s.code();=0A= >>> }=0A= >>>=0A= >>> int RocksDBStore::submit_transaction_sync(KeyValueDB::Transaction=0A= >>> t) @@ -385,7 +385,7 @@ int RocksDBStore::submit_transaction_sync(KeyVal= ueDB::Transaction t)=0A= >>> utime_t lat =3D ceph_clock_now(g_ceph_context) - start;=0A= >>> logger->inc(l_rocksdb_txns_sync);=0A= >>> logger->tinc(l_rocksdb_submit_sync_latency, lat);=0A= >>> - return s.ok() ? 0 : -1;=0A= >>> + return s.ok() ? 0 : -s.code();=0A= >>> }=0A= >>> int RocksDBStore::get_info_log_level(string info_log_level) { diff = =0A= >>> --git a/src/os/bluestore/BlueStore.cc =0A= >>> b/src/os/bluestore/BlueStore.cc index fe7f743..3f4ecd5 100644=0A= >>> --- a/src/os/bluestore/BlueStore.cc=0A= >>> +++ b/src/os/bluestore/BlueStore.cc=0A= >>> @@ -4989,6 +4989,9 @@ void BlueStore::_kv_sync_thread()=0A= >>> ++it) {=0A= >>> _txc_finalize_kv((*it), (*it)->t);=0A= >>> int r =3D db->submit_transaction((*it)->t);=0A= >>> + if (r < 0 ) {=0A= >>> + dout(0) << "submit_transaction returned =3D " << r << dend= l;=0A= >>> + }=0A= >>> assert(r =3D=3D 0);=0A= >>> }=0A= >>> }=0A= >>> @@ -5026,6 +5029,10 @@ void BlueStore::_kv_sync_thread()=0A= >>> t->rm_single_key(PREFIX_WAL, key);=0A= >>> }=0A= >>> int r =3D db->submit_transaction_sync(t);=0A= >>> + if (r < 0 ) {=0A= >>> + dout(0) << "submit_transaction_sync returned =3D " << r << den= dl;=0A= >>> + }=0A= >>> +=0A= >>> assert(r =3D=3D 0);=0A= >>>=0A= >>>=0A= >>> This is printing -1 in the log before asset. So, the corresponding code= from the rocksdb side is "kNotFound".=0A= >>> It is not related to space as I hit this same issue irrespective of db = partition size is 100G or 300G.=0A= >>> It seems some kind of corruption within Bluestore ?=0A= >>> Let me now the next step.=0A= >> Can you add this too?=0A= >>=0A= >> diff --git a/src/kv/RocksDBStore.cc b/src/kv/RocksDBStore.cc index=0A= >> 638d231..b5467f7 100644=0A= >> --- a/src/kv/RocksDBStore.cc=0A= >> +++ b/src/kv/RocksDBStore.cc=0A= >> @@ -370,6 +370,9 @@ int=0A= >> RocksDBStore::submit_transaction(KeyValueDB::Transaction t)=0A= >> utime_t lat =3D ceph_clock_now(g_ceph_context) - start;=0A= >> logger->inc(l_rocksdb_txns);=0A= >> logger->tinc(l_rocksdb_submit_latency, lat);=0A= >> + if (!s.ok()) {=0A= >> + derr << __func__ << " error: " << s.ToString() << dendl; }=0A= >> return s.ok() ? 0 : -1;=0A= >> }=0A= >> =0A= >> It's not obvious to me how we would get NotFound when doing a Write into= the kv store.=0A= >>=0A= >> Thanks!=0A= >> sage=0A= >>=0A= >>> Thanks & Regards=0A= >>> Somnath=0A= >>>=0A= >>> -----Original Message-----=0A= >>> From: Sage Weil [mailto:sweil@redhat.com]=0A= >>> Sent: Thursday, August 11, 2016 9:36 AM=0A= >>> To: Mark Nelson=0A= >>> Cc: Somnath Roy; ceph-devel=0A= >>> Subject: Re: Bluestore assert=0A= >>>=0A= >>> On Thu, 11 Aug 2016, Mark Nelson wrote:=0A= >>>> Sorry if I missed this during discussion, but why are these being =0A= >>>> called if the file is deleted?=0A= >>> I'm not sure... rocksdb is the one consuming the interface. Looking th= rough the code, though, this is the only way I can see that we could log an= op_file_update *after* an op_file_remove.=0A= >>>=0A= >>> sage=0A= >>>=0A= >>>> Mark=0A= >>>>=0A= >>>> On 08/11/2016 11:29 AM, Sage Weil wrote:=0A= >>>>> On Thu, 11 Aug 2016, Somnath Roy wrote:=0A= >>>>>> Sage,=0A= >>>>>> Please find the full log for the BlueFS replay bug in the =0A= >>>>>> following location.=0A= >>>>>>=0A= >>>>>> https://github.com/somnathr/ceph/blob/master/ceph-osd.1.log.zi=0A= >>>>>> p=0A= >>>>>>=0A= >>>>>> For the db transaction one , I have added code to dump the =0A= >>>>>> rocksdb error code before the assert as you suggested and waiting to= reproduce.=0A= >>>>> I'm pretty sure this is the root cause:=0A= >>>>>=0A= >>>>> https://github.com/ceph/ceph/pull/10686=0A= >>>>>=0A= >>>>> sage=0A= >>>>> --=0A= >>>>> To unsubscribe from this list: send the line "unsubscribe =0A= >>>>> ceph-devel" in the body of a message to =0A= >>>>> majordomo@vger.kernel.org More majordomo info at =0A= >>>>> http://vger.kernel.org/majordomo-info.html=0A= >>>>>=0A= >>>>=0A= >>> PLEASE NOTE: The information contained in this electronic mail message = is intended only for the use of the designated recipient(s) named above. If= the reader of this message is not the intended recipient, you are hereby n= otified that you have received this message in error and that any review, d= issemination, distribution, or copying of this message is strictly prohibit= ed. If you have received this communication in error, please notify the sen= der by telephone or e-mail (as shown above) immediately and destroy any and= all copies of this message in your possession (whether hard copies or elec= tronically stored copies).=0A= >>>=0A= >>>=0A= >> --=0A= >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" = =0A= >> in the body of a message to majordomo@vger.kernel.org More majordomo =0A= >> info at http://vger.kernel.org/majordomo-info.html=0A= >>=0A= >>=0A= > --=0A= > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in = the body of a message to majordomo@vger.kernel.org More majordomo info at = http://vger.kernel.org/majordomo-info.html=0A= > --=0A= > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in= =0A= > the body of a message to majordomo@vger.kernel.org=0A= > More majordomo info at http://vger.kernel.org/majordomo-info.html=0A= >=0A= =0A=