Re: Bluestore OSD support in ceph-disk

From: "Kamble, Nitin A" <Nitin.Kamble@Teradata.com>
To: Sage Weil <sage@newdream.net>
Cc: Somnath Roy <Somnath.Roy@sandisk.com>,
	Ceph Development <ceph-devel@vger.kernel.org>
Subject: Re: Bluestore OSD support in ceph-disk
Date: Sun, 18 Sep 2016 06:41:28 +0000	[thread overview]
Message-ID: <986F2B7B-C38D-4B9D-8E33-561DADC58B51@Teradata.com> (raw)
In-Reply-To: <alpine.DEB.2.11.1609171412030.1040@piezo.us.to>

> On Sep 17, 2016, at 7:14 AM, Sage Weil <sage@newdream.net> wrote:
> 
> On Fri, 16 Sep 2016, Kamble, Nitin A wrote:
>>> On Sep 16, 2016, at 1:54 PM, Sage Weil <sage@newdream.net> wrote:
>>> 
>>> On Fri, 16 Sep 2016, Kamble, Nitin A wrote:
>>>>> On Sep 16, 2016, at 12:23 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>>>>> 
>>>>> How you configured bluestore, all default ? i.e all in single partition , no separate partition for db/wal ?
>>>> It is separated partitions for data(SSD), wal(SSD), rocksdb(SSD), & block store (HDD).
>>>> 
>>>>> Wondering if you are out of db space/disk space ?
>>>> I notice a misconfiguration on the cluster now. The wal & db partition use got swapped, so it is getting just 128MB db partition now. Probably this is the cause of the assert.
>>> 
>>> FWIW bluefs is supposed to fall back on any allocation failure to the next 
>>> larger/slower device (wal -> db -> primary), so having a tiny wal or tiny 
>>> db shouldn't actually matter.  A bluefs log (debug bluefs = 10 or 20) 
>>> leading up to any crash there would be helpful.
>>> 
>>> Thanks!
>>> sage
>>> 
>> 
>> Good to know this fall back mechanism.
>> In my previous run the partitions and sizes in config did not match. I see the ceph-daemon-dump
>> showing 900MB+ used for db while db partition was 128MB. I was thinking it started overwriting on
>> to the next partition. But instead as per this backup logic it started using HDD for db.
>> 
>> One issue I see is that, ceph.conf lists the sizes of the wal,db,& block. But it is possible that actual
>> partitions may have different sizes. From the ceph-daemon-dump output looks like it is not looking
>> at the partition’s real size, instead the code is assuming the sizes from the config file as the partition
>> sizes. I think probing of the size of existing devices/files will be better than taking the sizes from the
>> config file blindly.
> 
> Those bluestore_block*_size options are only used in conjunction with 
> bluestore_block*_create = true to create a dummy block device file for 
> debugging/testing purposes.

So if I set these config options as false in the ceph.conf then the default sizes should get ignored, right?

> 
> Are you referring to the perf counters (ceph daemon osd.N perf dump) 
> figures?  Can you share the numbers you saw?  It sounds like there might 
> be a bug in the instrumentation.

Yes, I referred to the perf counter you mentioned at the beginning of this thread.
Unfortunately I am resurrecting the cluster frequently to get some performance data out, so 
I cant’ show you the numbers right now. I will try to capture it next time.
The 1st attempt where I messed up partition links, at that time I noticed that the perf stat
for db was showing size from the config file which was much bigger (37GB) than the partition’s
actual size (128MB). There is a possibility I may have made a mistake in identifying the partitions correctly,
as I was focusing on the getting setup working more than looking for possible bugs in the implementation.
But it will be easy and quick to just try it out to verify which partition size is shown in the perf stat.

BTW In another run few OSDs died with the following errors. Here the partition layout was not messed up.

osd.2 

2016-09-17 02:36:18.460357 7f1fd38c9700 -1 *** Caught signal (Aborted) **
 in thread 7f1fd38c9700 thread_name:tp_osd_tp

 ceph version v11.0.0-2309-g9096ad3 (9096ad37f2c0798c26d7784fb4e7a781feb72cb8)
 1: (()+0x892dd2) [0x7f1ff24d9dd2]
 2: (()+0xf890) [0x7f1fee6cf890]
 3: (pthread_cond_wait()+0xbf) [0x7f1fee6cc05f]
 4: (Throttle::_wait(long)+0x1a4) [0x7f1ff2651994]
 5: (Throttle::get(long, long)+0xc7) [0x7f1ff26525f7]
 6: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<TrackedOp>, Thread
Pool::TPHandle*)+0x8a3) [0x7f1ff2407ad3]
 7: (ReplicatedPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<OpRequest>)+0x7c) [0x7f1ff21ec93c]
 8: (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequest>)+0xb8f) [0x7f1ff22bcf0f]
 9: (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0x3b3) [0x7f1ff22bd883]
 10: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xfa) [0x7f1ff218c8da]
 11: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f5) [0x7f1ff2045905]
 12: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest> const&)+0x5d) [0x7f1ff2045b2d]
 13: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x874) [0x7f1ff2066e84]
 14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x947) [0x7f1ff2660a37]
 15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f1ff2662b90]
 16: (()+0x80a4) [0x7f1fee6c80a4]
 17: (clone()+0x6d) [0x7f1fed54104d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Osd.3 

2016-09-16 23:10:39.690923 7f1bb9c6f700 -1 /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/BlueFS.cc: In f
unction 'int BlueFS::_allocate(uint8_t, uint64_t, std::vector<bluefs_extent_t>*)' thread 7f1bb9c6f700 time 2016-09-16 23:10:39.686574
/build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/BlueFS.cc: 1686: FAILED assert(0 == "allocate failed... wt
f")

 ceph version v11.0.0-2309-g9096ad3 (9096ad37f2c0798c26d7784fb4e7a781feb72cb8)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7f1be2a1511b]
 2: (BlueFS::_allocate(unsigned char, unsigned long, std::vector<bluefs_extent_t, std::allocator<bluefs_extent_t> >*)+0x8ad) [0x7f1be284e5dd]
 3: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0xb4f) [0x7f1be2855a1f]
 4: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x29b) [0x7f1be2856c9b]
 5: (BlueRocksWritableFile::Sync()+0x4e) [0x7f1be286c25e]
 6: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x139) [0x7f1be2963699]
 7: (rocksdb::WritableFileWriter::Sync(bool)+0x88) [0x7f1be2964238]
 8: (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status const&, rocksdb::CompactionJob::SubcompactionState*)+0x8c3) [0x7f1be29a6fa3]
 9: (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0xb19) [0x7f1be29a8359]
 10: (rocksdb::CompactionJob::Run()+0x1cf) [0x7f1be29a990f]
 11: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, void*)+0xa0c) [0x7f1be28aaf4c]
 12: (rocksdb::DBImpl::BackgroundCallCompaction(void*)+0x27d) [0x7f1be28ba18d]
 13: (rocksdb::ThreadPoolImpl::BGThread(unsigned long)+0x1a1) [0x7f1be296b8a1]
 14: (()+0x96a983) [0x7f1be296b983]
 15: (()+0x80a4) [0x7f1bdea820a4]
 16: (clone()+0x6d) [0x7f1bdd8fb04d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Also I noticed 3 more OSDs on different nodes were in the down state without any errors in their logs.
The ceph-osd processes for these OSDs were running and were consuming 100% cpu, while other
well-behaving OSDs were in up state and taking cpu of 1 to 3%.
I checked the perf stat information, and noticed all the 3 had WAL (128MB) fully consumed with 0 bytes free.
I have resurrected the setup after that, and I am currently trying to focus on getting some
performance numbers of the latest bluestore in a constrained time limit. Once I am done with that, then 
I will be help out more with gathering more detailed data, and with some debugging efforts as well.

Thanks,
Nitin

> 
>> After 5 hours or so 6+ osds were down out of 30.
>> We will be running the stress test one again with the fixed partition configuration with debug level of
>> 0, to get max performance out. And if that fails then I will switch to debug level 10 or 20, and gather
>> some detailed logs.
> 
> Any backtraces you see would be helpful.
> 
> Thanks!
> sage
> 
> 
>> 
>> Thanks,
>> Nitin
>> 
>>> 
>>>>> We had some issues in this front sometimes back which was fixed, may be a new issue (?). Need verbose log for at least bluefs (debug_bluefs = 20/20)
>>>> 
>>>> Let me fix the cluster configuration, to give better space to the DB partition. And if with that this issue comes up then I will try capturing detailed logs.
>>>> 
>>>>> BTW, what is your workload (block size, IO pattern ) ?
>>>> 
>>>> The workload is internal teradata benchmark, which simulates IO pattern of database disk access with various block sizes and IO pattern. 
>>>> 
>>>> Thanks,
>>>> Nitin
>>>> 
>>>> 
>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Kamble, Nitin A [mailto:Nitin.Kamble@Teradata.com]
>>>>> Sent: Friday, September 16, 2016 12:00 PM
>>>>> To: Somnath Roy
>>>>> Cc: Sage Weil; Ceph Development
>>>>> Subject: Re: Bluestore OSD support in ceph-disk
>>>>> 
>>>>> 
>>>>>> On Sep 16, 2016, at 11:43 AM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>>>>>> 
>>>>>> Please send the snippet (very first trace , go up in the log) where it is actually printing the assert.
>>>>>> BTW, what workload you are running ?
>>>>>> 
>>>>>> Thanks & Regards
>>>>>> Somnath
>>>>>> 
>>>>> Here it is.
>>>>> 
>>>>> 2016-09-16 08:49:30.605845 7fb5a96ba700 -1 /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_allocate(uint8_t, uint64_t, std::vecto r<bluefs_extent_t>*)' thread 7fb5a96ba700 time 2016-09-16 08:49:30.602139
>>>>> /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/BlueFS.cc: 1686: FAILED assert(0 == "allocate failed... wtf")
>>>>> 
>>>>> ceph version v11.0.0-2309-g9096ad3 (9096ad37f2c0798c26d7784fb4e7a781feb72cb8)
>>>>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7fb5bf43a11b]
>>>>> 2: (BlueFS::_allocate(unsigned char, unsigned long, std::vector<bluefs_extent_t, std::allocator<bluefs_extent_t> >*)+0x8ad) [0x7fb5bf2735dd]
>>>>> 3: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0xb4f) [0x7fb5bf27aa1f]
>>>>> 4: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x29b) [0x7fb5bf27bc9b]
>>>>> 5: (BlueRocksWritableFile::Sync()+0x4e) [0x7fb5bf29125e]
>>>>> 6: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x139) [0x7fb5bf388699]
>>>>> 7: (rocksdb::WritableFileWriter::Sync(bool)+0x88) [0x7fb5bf389238]
>>>>> 8: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool)+0x13cf) [0x7fb5bf2e0a2f]
>>>>> 9: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x27) [0x7fb5bf2e1637]
>>>>> 10: (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x5b) [0x7fb5bf21a14b]
>>>>> 11: (BlueStore::_kv_sync_thread()+0xf5a) [0x7fb5bf1e7ffa]
>>>>> 12: (BlueStore::KVSyncThread::entry()+0xd) [0x7fb5bf1f5a6d]
>>>>> 13: (()+0x80a4) [0x7fb5bb4a70a4]
>>>>> 14: (clone()+0x6d) [0x7fb5ba32004d]
>>>>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>>> 
>>>>> 
>>>>> 
>>>>> Thanks,
>>>>> Nitin
>>>>> 
>>>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>>> 
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> N?????r??y??????X??ǧv???)޺{.n?????z?]z????ay?\x1dʇڙ??j\a??f???h?????\x1e?w???\f???j:+v???w????????\a????zZ+???????j"????i