All of lore.kernel.org
 help / color / mirror / Atom feed
* Best latest commit to run bluestore
@ 2016-09-14 19:10 Kamble, Nitin A
  2016-09-14 19:13 ` Somnath Roy
  0 siblings, 1 reply; 28+ messages in thread
From: Kamble, Nitin A @ 2016-09-14 19:10 UTC (permalink / raw)
  To: Ceph Development; +Cc: Somnath Roy, Sage Weil

Hi Sage, Somnath,

I would like to see how the current bluestore implementation performs with some workload.
I am following the ceph-devel mailing list, so I am aware there is the memory leak issue 
in the bluestore code now. 

I also notice that, Somnath has used the master branch recently to generate some benchmark
numbers to generate pretty graphs. I do not worry about the loosing the data on the cluster at
this point, this is not a production setup, just a experimentation to evaluate current bluestore
performance vs filestore.

I am planning to build the ceph packages from the the latest ceph master branch to do some 
workload testing on KRBD images backed up by Bluestore. I understand that the master 
branch is work in progress, can break things,... etc. :)
  Can you guys suggest me the best and the latest master commit which I should be able use
to stress bluestore via krbd for 20+ hours?
   If nothing comes up to mind, then I can pick the commit Somnath used to generate
the pretty graphs.

Thanks in advance for your help.
Nitin



^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Best latest commit to run bluestore
  2016-09-14 19:10 Best latest commit to run bluestore Kamble, Nitin A
@ 2016-09-14 19:13 ` Somnath Roy
  2016-09-14 19:47   ` Kamble, Nitin A
  0 siblings, 1 reply; 28+ messages in thread
From: Somnath Roy @ 2016-09-14 19:13 UTC (permalink / raw)
  To: Kamble, Nitin A, Ceph Development; +Cc: Sage Weil

Hi Nitin,
Try with latest master along with the latest rocksdb pull (now should be coming via submodule automatically) , so far it is looking good.

Thanks & Regards
Somnath

-----Original Message-----
From: Kamble, Nitin A [mailto:Nitin.Kamble@Teradata.com]
Sent: Wednesday, September 14, 2016 12:11 PM
To: Ceph Development
Cc: Somnath Roy; Sage Weil
Subject: Best latest commit to run bluestore

Hi Sage, Somnath,

I would like to see how the current bluestore implementation performs with some workload.
I am following the ceph-devel mailing list, so I am aware there is the memory leak issue in the bluestore code now.

I also notice that, Somnath has used the master branch recently to generate some benchmark numbers to generate pretty graphs. I do not worry about the loosing the data on the cluster at this point, this is not a production setup, just a experimentation to evaluate current bluestore performance vs filestore.

I am planning to build the ceph packages from the the latest ceph master branch to do some workload testing on KRBD images backed up by Bluestore. I understand that the master branch is work in progress, can break things,... etc. :)
  Can you guys suggest me the best and the latest master commit which I should be able use to stress bluestore via krbd for 20+ hours?
   If nothing comes up to mind, then I can pick the commit Somnath used to generate the pretty graphs.

Thanks in advance for your help.
Nitin


PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Best latest commit to run bluestore
  2016-09-14 19:13 ` Somnath Roy
@ 2016-09-14 19:47   ` Kamble, Nitin A
  2016-09-15 18:23     ` Bluestore OSD support in ceph-disk Kamble, Nitin A
  0 siblings, 1 reply; 28+ messages in thread
From: Kamble, Nitin A @ 2016-09-14 19:47 UTC (permalink / raw)
  To: Somnath Roy; +Cc: Ceph Development, Sage Weil


> On Sep 14, 2016, at 12:13 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> 
> Hi Nitin,
> Try with latest master along with the latest rocksdb pull (now should be coming via submodule automatically) , so far it is looking good.
> 
> Thanks & Regards
> Somnath
> 

Hi Somnath,
    Thanks for the quick response.
Nitin

> -----Original Message-----
> From: Kamble, Nitin A [mailto:Nitin.Kamble@Teradata.com]
> Sent: Wednesday, September 14, 2016 12:11 PM
> To: Ceph Development
> Cc: Somnath Roy; Sage Weil
> Subject: Best latest commit to run bluestore
> 
> Hi Sage, Somnath,
> 
> I would like to see how the current bluestore implementation performs with some workload.
> I am following the ceph-devel mailing list, so I am aware there is the memory leak issue in the bluestore code now.
> 
> I also notice that, Somnath has used the master branch recently to generate some benchmark numbers to generate pretty graphs. I do not worry about the loosing the data on the cluster at this point, this is not a production setup, just a experimentation to evaluate current bluestore performance vs filestore.
> 
> I am planning to build the ceph packages from the the latest ceph master branch to do some workload testing on KRBD images backed up by Bluestore. I understand that the master branch is work in progress, can break things,... etc. :)
>  Can you guys suggest me the best and the latest master commit which I should be able use to stress bluestore via krbd for 20+ hours?
>   If nothing comes up to mind, then I can pick the commit Somnath used to generate the pretty graphs.
> 
> Thanks in advance for your help.
> Nitin
> 
> 
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Bluestore OSD support in ceph-disk
  2016-09-14 19:47   ` Kamble, Nitin A
@ 2016-09-15 18:23     ` Kamble, Nitin A
  2016-09-15 18:34       ` Sage Weil
  0 siblings, 1 reply; 28+ messages in thread
From: Kamble, Nitin A @ 2016-09-15 18:23 UTC (permalink / raw)
  To: Somnath Roy; +Cc: Ceph Development, Sage Weil

Can I use ceph-disk to prepare a bluestore OSD now?

I would like to know proper command line parameters for ceph-disk .

The following related issue tracker has closed, does it mean it is ready to use for creation of bluestore OSDs?

From: http://tracker.ceph.com/issues/13942

>
> Updated by Sage Weil 9 months ago
>	• Status changed from In Progress to Verified
>	• Assignee deleted (Loic Dachary)
> For the 'ceph-disk prepare' part, I think we should keep it simple initially:
>
> ceph-disk --osd-objectstore bluestore maindev[:dbdev[:waldev]]
> and teach ceph-disk how to do the partitioning for bluestore (no generic way to ask ceph-osd that). We can leave off the db/wal devices initially, and then make activate work, so that there is something functional. Then add dbdev and waldev support last.
>

Thanks,
Nitin


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bluestore OSD support in ceph-disk
  2016-09-15 18:23     ` Bluestore OSD support in ceph-disk Kamble, Nitin A
@ 2016-09-15 18:34       ` Sage Weil
  2016-09-15 18:46         ` Kamble, Nitin A
  0 siblings, 1 reply; 28+ messages in thread
From: Sage Weil @ 2016-09-15 18:34 UTC (permalink / raw)
  To: Kamble, Nitin A; +Cc: Somnath Roy, Ceph Development

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1058 bytes --]

On Thu, 15 Sep 2016, Kamble, Nitin A wrote:
> Can I use ceph-disk to prepare a bluestore OSD now?
> 
> I would like to know proper command line parameters for ceph-disk .
> 
> The following related issue tracker has closed, does it mean it is ready 
> to use for creation of bluestore OSDs?
> 
> From: http://tracker.ceph.com/issues/13942
> 
> >
> > Updated by Sage Weil 9 months ago
> >	• Status changed from In Progress to Verified
> >	• Assignee deleted (Loic Dachary)
> > For the 'ceph-disk prepare' part, I think we should keep it simple initially:
> >
> > ceph-disk --osd-objectstore bluestore maindev[:dbdev[:waldev]]
> > and teach ceph-disk how to do the partitioning for bluestore (no generic way to ask ceph-osd that). We can leave off the db/wal devices initially, and then make activate work, so that there is something functional. Then add dbdev and waldev support last.

You just need to pass --bluestore to ceph-disk for a single-disk setup. 
The multi-device PR is still pending, but close:

	https://github.com/ceph/ceph/pull/10135

sage

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bluestore OSD support in ceph-disk
  2016-09-15 18:34       ` Sage Weil
@ 2016-09-15 18:46         ` Kamble, Nitin A
  2016-09-15 18:54           ` Sage Weil
  0 siblings, 1 reply; 28+ messages in thread
From: Kamble, Nitin A @ 2016-09-15 18:46 UTC (permalink / raw)
  To: Sage Weil; +Cc: Somnath Roy, Ceph Development


> On Sep 15, 2016, at 11:34 AM, Sage Weil <sage@newdream.net> wrote:
> 
> On Thu, 15 Sep 2016, Kamble, Nitin A wrote:
>> Can I use ceph-disk to prepare a bluestore OSD now?
>> 
>> I would like to know proper command line parameters for ceph-disk .
>> 
>> The following related issue tracker has closed, does it mean it is ready 
>> to use for creation of bluestore OSDs?
>> 
>> From: http://tracker.ceph.com/issues/13942
>> 
>>> 
>>> Updated by Sage Weil 9 months ago
>>> 	• Status changed from In Progress to Verified
>>> 	• Assignee deleted (Loic Dachary)
>>> For the 'ceph-disk prepare' part, I think we should keep it simple initially:
>>> 
>>> ceph-disk --osd-objectstore bluestore maindev[:dbdev[:waldev]]
>>> and teach ceph-disk how to do the partitioning for bluestore (no generic way to ask ceph-osd that). We can leave off the db/wal devices initially, and then make activate work, so that there is something functional. Then add dbdev and waldev support last.
> 
> You just need to pass --bluestore to ceph-disk for a single-disk setup. 
> The multi-device PR is still pending, but close:
> 
> 	https://github.com/ceph/ceph/pull/10135
> 
> sage
>
> sudo ceph-disk prepare --bluestore /dev/sdb --block.db /dev/sdc --block.wal /dev/sdc
> 
> This will create 2 partitions on sdb and 2 partitions on sdc, then 'block.db' will symlink to partition 1 of sdc, 'block.wal' will symlink to partition 2 of sdc, 'block' will symlink to partition 2 of sdb.
>

Nice!  I am eager to see it go into the master branch, and try it out. Any approximate ETA for merging this PR?

Looks like the WAL size is of fixed 128MB irrespective of the block store size. How to determine the metadata db size? It needs to proportional to the block store size. Is there any recommended ratio?



Thanks,
Nitin


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bluestore OSD support in ceph-disk
  2016-09-15 18:46         ` Kamble, Nitin A
@ 2016-09-15 18:54           ` Sage Weil
  2016-09-16  6:43             ` Kamble, Nitin A
  0 siblings, 1 reply; 28+ messages in thread
From: Sage Weil @ 2016-09-15 18:54 UTC (permalink / raw)
  To: Kamble, Nitin A; +Cc: Somnath Roy, Ceph Development

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2820 bytes --]

On Thu, 15 Sep 2016, Kamble, Nitin A wrote:
> > On Sep 15, 2016, at 11:34 AM, Sage Weil <sage@newdream.net> wrote:
> > 
> > On Thu, 15 Sep 2016, Kamble, Nitin A wrote:
> >> Can I use ceph-disk to prepare a bluestore OSD now?
> >> 
> >> I would like to know proper command line parameters for ceph-disk .
> >> 
> >> The following related issue tracker has closed, does it mean it is ready 
> >> to use for creation of bluestore OSDs?
> >> 
> >> From: http://tracker.ceph.com/issues/13942
> >> 
> >>> 
> >>> Updated by Sage Weil 9 months ago
> >>> 	• Status changed from In Progress to Verified
> >>> 	• Assignee deleted (Loic Dachary)
> >>> For the 'ceph-disk prepare' part, I think we should keep it simple initially:
> >>> 
> >>> ceph-disk --osd-objectstore bluestore maindev[:dbdev[:waldev]]
> >>> and teach ceph-disk how to do the partitioning for bluestore (no generic way to ask ceph-osd that). We can leave off the db/wal devices initially, and then make activate work, so that there is something functional. Then add dbdev and waldev support last.
> > 
> > You just need to pass --bluestore to ceph-disk for a single-disk setup. 
> > The multi-device PR is still pending, but close:
> > 
> > 	https://github.com/ceph/ceph/pull/10135
> > 
> > sage
> >
> > sudo ceph-disk prepare --bluestore /dev/sdb --block.db /dev/sdc --block.wal /dev/sdc
> > 
> > This will create 2 partitions on sdb and 2 partitions on sdc, then 'block.db' will symlink to partition 1 of sdc, 'block.wal' will symlink to partition 2 of sdc, 'block' will symlink to partition 2 of sdb.
> >
> 
> Nice!  I am eager to see it go into the master branch, and try it out. 
> Any approximate ETA for merging this PR?
> 
> Looks like the WAL size is of fixed 128MB irrespective of the block 
> store size. How to determine the metadata db size? It needs to 
> proportional to the block store size. Is there any recommended ratio?

The 128MB figure is mostly pulled out of a hat.  I suspect it will be 
reasonable, but a proper recommendation is going to depend on how we end 
up tuning rocksdb, and we've put that off until the metadata format is 
finalized and any rocksdb tuning we do will be meaningful.  We're pretty 
much at that point now...

Whatever it is, it should be related to the request rate, and perhaps the 
relative speed of the wal device and the db or main device.  The size of 
the slower devices shouldn't matter, though.

There are some bluefs perf counters that let you monitor what the wal 
device utilization is.  See 

  b.add_u64(l_bluefs_wal_total_bytes, "wal_total_bytes",
	    "Total bytes (wal device)");
  b.add_u64(l_bluefs_wal_free_bytes, "wal_free_bytes",
	    "Free bytes (wal device)");

which you can monitor via 'ceph daemon osd.N perf dump'.  If you 
discover anything interesting, let us know!

Thanks-
sage

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bluestore OSD support in ceph-disk
  2016-09-15 18:54           ` Sage Weil
@ 2016-09-16  6:43             ` Kamble, Nitin A
  2016-09-16 18:38               ` Kamble, Nitin A
  0 siblings, 1 reply; 28+ messages in thread
From: Kamble, Nitin A @ 2016-09-16  6:43 UTC (permalink / raw)
  To: Sage Weil; +Cc: Somnath Roy, Ceph Development


> On Sep 15, 2016, at 11:54 AM, Sage Weil <sage@newdream.net> wrote:
> 
> 
> The 128MB figure is mostly pulled out of a hat.  I suspect it will be 
> reasonable, but a proper recommendation is going to depend on how we end 
> up tuning rocksdb, and we've put that off until the metadata format is 
> finalized and any rocksdb tuning we do will be meaningful.  We're pretty 
> much at that point now...
> 
> Whatever it is, it should be related to the request rate, and perhaps the 
> relative speed of the wal device and the db or main device.  The size of 
> the slower devices shouldn't matter, though.
> 
> There are some bluefs perf counters that let you monitor what the wal 
> device utilization is.  See 
> 
>  b.add_u64(l_bluefs_wal_total_bytes, "wal_total_bytes",
> 	    "Total bytes (wal device)");
>  b.add_u64(l_bluefs_wal_free_bytes, "wal_free_bytes",
> 	    "Free bytes (wal device)");
> 
> which you can monitor via 'ceph daemon osd.N perf dump'.  If you 
> discover anything interesting, let us know!
> 
> Thanks-
> sage

I could build and deploy the latest master (commit: 9096ad37f2c0798c26d7784fb4e7a781feb72cb8) with partitioned bluestore. I struggled a bit to bring up OSDs as the available documentation for bringing up the partitioned bluestore OSDs is mostly primitive so far. Once ceph-disk gets updated this pain will go away. We will stress the cluster shortly, but so far I am delighted to see that from ground-zero it is able stand up on it’s own feet to HEALTH_OK without any errors. If I see any issues in our tests I will share it here.

Thanks,
Nitin


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bluestore OSD support in ceph-disk
  2016-09-16  6:43             ` Kamble, Nitin A
@ 2016-09-16 18:38               ` Kamble, Nitin A
  2016-09-16 18:43                 ` Somnath Roy
  0 siblings, 1 reply; 28+ messages in thread
From: Kamble, Nitin A @ 2016-09-16 18:38 UTC (permalink / raw)
  To: Sage Weil; +Cc: Somnath Roy, Ceph Development


> On Sep 15, 2016, at 11:43 PM, Kamble, Nitin A <Nitin.Kamble@Teradata.com> wrote:
> 
>> 
>> On Sep 15, 2016, at 11:54 AM, Sage Weil <sage@newdream.net> wrote:
>> 
>> 
>> The 128MB figure is mostly pulled out of a hat.  I suspect it will be 
>> reasonable, but a proper recommendation is going to depend on how we end 
>> up tuning rocksdb, and we've put that off until the metadata format is 
>> finalized and any rocksdb tuning we do will be meaningful.  We're pretty 
>> much at that point now...
>> 
>> Whatever it is, it should be related to the request rate, and perhaps the 
>> relative speed of the wal device and the db or main device.  The size of 
>> the slower devices shouldn't matter, though.
>> 
>> There are some bluefs perf counters that let you monitor what the wal 
>> device utilization is.  See 
>> 
>> b.add_u64(l_bluefs_wal_total_bytes, "wal_total_bytes",
>> 	    "Total bytes (wal device)");
>> b.add_u64(l_bluefs_wal_free_bytes, "wal_free_bytes",
>> 	    "Free bytes (wal device)");
>> 
>> which you can monitor via 'ceph daemon osd.N perf dump'.  If you 
>> discover anything interesting, let us know!
>> 
>> Thanks-
>> sage
> 
> I could build and deploy the latest master (commit: 9096ad37f2c0798c26d7784fb4e7a781feb72cb8) with partitioned bluestore. I struggled a bit to bring up OSDs as the available documentation for bringing up the partitioned bluestore OSDs is mostly primitive so far. Once ceph-disk gets updated this pain will go away. We will stress the cluster shortly, but so far I am delighted to see that from ground-zero it is able stand up on it’s own feet to HEALTH_OK without any errors. If I see any issues in our tests I will share it here.
> 
> Thanks,
> Nitin

Out of 30 OSDs, one failed after stress of 1.5hrs. Rest of the 29 OSDs are holding on fine for many hours.
If needed I can provide the executable or objdump.


 ceph version v11.0.0-2309-g9096ad3 (9096ad37f2c0798c26d7784fb4e7a781feb72cb8)
 1: (()+0x892dd2) [0x7fb5bf2b8dd2]
 2: (()+0xf890) [0x7fb5bb4ae890]
 3: (gsignal()+0x37) [0x7fb5ba270187]
 4: (abort()+0x118) [0x7fb5ba271538]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x265) [0x7fb5bf43a2f5]
 6: (BlueFS::_allocate(unsigned char, unsigned long, std::vector<bluefs_extent_t, std::allocator<bluefs_extent_t> >*)+0x8ad) [0x7fb5bf2735dd]
 7: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0xb4f) [0x7fb5bf27aa1f]
 8: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x29b) [0x7fb5bf27bc9b]
 9: (BlueRocksWritableFile::Sync()+0x4e) [0x7fb5bf29125e]
 10: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x139) [0x7fb5bf388699]
 11: (rocksdb::WritableFileWriter::Sync(bool)+0x88) [0x7fb5bf389238]
 12: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool)+0x13cf) [0x7fb5bf2e0a2f]
 13: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x27) [0x7fb5bf2e1637]
 14: (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x5b) [0x7fb5bf21a14b]
 15: (BlueStore::_kv_sync_thread()+0xf5a) [0x7fb5bf1e7ffa]
 16: (BlueStore::KVSyncThread::entry()+0xd) [0x7fb5bf1f5a6d]
 17: (()+0x80a4) [0x7fb5bb4a70a4]
 18: (clone()+0x6d) [0x7fb5ba32004d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Thanks,
Nitin



^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Bluestore OSD support in ceph-disk
  2016-09-16 18:38               ` Kamble, Nitin A
@ 2016-09-16 18:43                 ` Somnath Roy
  2016-09-16 19:00                   ` Kamble, Nitin A
  0 siblings, 1 reply; 28+ messages in thread
From: Somnath Roy @ 2016-09-16 18:43 UTC (permalink / raw)
  To: Kamble, Nitin A, Sage Weil; +Cc: Ceph Development

Please send the snippet (very first trace , go up in the log) where it is actually printing the assert.
BTW, what workload you are running ?

Thanks & Regards
Somnath

-----Original Message-----
From: Kamble, Nitin A [mailto:Nitin.Kamble@Teradata.com]
Sent: Friday, September 16, 2016 11:38 AM
To: Sage Weil
Cc: Somnath Roy; Ceph Development
Subject: Re: Bluestore OSD support in ceph-disk


> On Sep 15, 2016, at 11:43 PM, Kamble, Nitin A <Nitin.Kamble@Teradata.com> wrote:
>
>>
>> On Sep 15, 2016, at 11:54 AM, Sage Weil <sage@newdream.net> wrote:
>>
>>
>> The 128MB figure is mostly pulled out of a hat.  I suspect it will be
>> reasonable, but a proper recommendation is going to depend on how we
>> end up tuning rocksdb, and we've put that off until the metadata
>> format is finalized and any rocksdb tuning we do will be meaningful.
>> We're pretty much at that point now...
>>
>> Whatever it is, it should be related to the request rate, and perhaps
>> the relative speed of the wal device and the db or main device.  The
>> size of the slower devices shouldn't matter, though.
>>
>> There are some bluefs perf counters that let you monitor what the wal
>> device utilization is.  See
>>
>> b.add_u64(l_bluefs_wal_total_bytes, "wal_total_bytes",
>>     "Total bytes (wal device)");
>> b.add_u64(l_bluefs_wal_free_bytes, "wal_free_bytes",
>>     "Free bytes (wal device)");
>>
>> which you can monitor via 'ceph daemon osd.N perf dump'.  If you
>> discover anything interesting, let us know!
>>
>> Thanks-
>> sage
>
> I could build and deploy the latest master (commit: 9096ad37f2c0798c26d7784fb4e7a781feb72cb8) with partitioned bluestore. I struggled a bit to bring up OSDs as the available documentation for bringing up the partitioned bluestore OSDs is mostly primitive so far. Once ceph-disk gets updated this pain will go away. We will stress the cluster shortly, but so far I am delighted to see that from ground-zero it is able stand up on it’s own feet to HEALTH_OK without any errors. If I see any issues in our tests I will share it here.
>
> Thanks,
> Nitin

Out of 30 OSDs, one failed after stress of 1.5hrs. Rest of the 29 OSDs are holding on fine for many hours.
If needed I can provide the executable or objdump.


 ceph version v11.0.0-2309-g9096ad3 (9096ad37f2c0798c26d7784fb4e7a781feb72cb8)
 1: (()+0x892dd2) [0x7fb5bf2b8dd2]
 2: (()+0xf890) [0x7fb5bb4ae890]
 3: (gsignal()+0x37) [0x7fb5ba270187]
 4: (abort()+0x118) [0x7fb5ba271538]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x265) [0x7fb5bf43a2f5]
 6: (BlueFS::_allocate(unsigned char, unsigned long, std::vector<bluefs_extent_t, std::allocator<bluefs_extent_t> >*)+0x8ad) [0x7fb5bf2735dd]
 7: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0xb4f) [0x7fb5bf27aa1f]
 8: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x29b) [0x7fb5bf27bc9b]
 9: (BlueRocksWritableFile::Sync()+0x4e) [0x7fb5bf29125e]
 10: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x139) [0x7fb5bf388699]
 11: (rocksdb::WritableFileWriter::Sync(bool)+0x88) [0x7fb5bf389238]
 12: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool)+0x13cf) [0x7fb5bf2e0a2f]
 13: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x27) [0x7fb5bf2e1637]
 14: (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x5b) [0x7fb5bf21a14b]
 15: (BlueStore::_kv_sync_thread()+0xf5a) [0x7fb5bf1e7ffa]
 16: (BlueStore::KVSyncThread::entry()+0xd) [0x7fb5bf1f5a6d]
 17: (()+0x80a4) [0x7fb5bb4a70a4]
 18: (clone()+0x6d) [0x7fb5ba32004d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Thanks,
Nitin


PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bluestore OSD support in ceph-disk
  2016-09-16 18:43                 ` Somnath Roy
@ 2016-09-16 19:00                   ` Kamble, Nitin A
  2016-09-16 19:23                     ` Somnath Roy
  0 siblings, 1 reply; 28+ messages in thread
From: Kamble, Nitin A @ 2016-09-16 19:00 UTC (permalink / raw)
  To: Somnath Roy; +Cc: Sage Weil, Ceph Development


> On Sep 16, 2016, at 11:43 AM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> 
> Please send the snippet (very first trace , go up in the log) where it is actually printing the assert.
> BTW, what workload you are running ?
> 
> Thanks & Regards
> Somnath
> 
Here it is.

2016-09-16 08:49:30.605845 7fb5a96ba700 -1 /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_allocate(uint8_t, uint64_t, std::vecto
r<bluefs_extent_t>*)' thread 7fb5a96ba700 time 2016-09-16 08:49:30.602139
/build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/BlueFS.cc: 1686: FAILED assert(0 == "allocate failed... wtf")

 ceph version v11.0.0-2309-g9096ad3 (9096ad37f2c0798c26d7784fb4e7a781feb72cb8)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7fb5bf43a11b]
 2: (BlueFS::_allocate(unsigned char, unsigned long, std::vector<bluefs_extent_t, std::allocator<bluefs_extent_t> >*)+0x8ad) [0x7fb5bf2735dd]
 3: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0xb4f) [0x7fb5bf27aa1f]
 4: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x29b) [0x7fb5bf27bc9b]
 5: (BlueRocksWritableFile::Sync()+0x4e) [0x7fb5bf29125e]
 6: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x139) [0x7fb5bf388699]
 7: (rocksdb::WritableFileWriter::Sync(bool)+0x88) [0x7fb5bf389238]
 8: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool)+0x13cf) [0x7fb5bf2e0a2f]
 9: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x27) [0x7fb5bf2e1637]
 10: (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x5b) [0x7fb5bf21a14b]
 11: (BlueStore::_kv_sync_thread()+0xf5a) [0x7fb5bf1e7ffa]
 12: (BlueStore::KVSyncThread::entry()+0xd) [0x7fb5bf1f5a6d]
 13: (()+0x80a4) [0x7fb5bb4a70a4]
 14: (clone()+0x6d) [0x7fb5ba32004d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.



Thanks,
Nitin


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Bluestore OSD support in ceph-disk
  2016-09-16 19:00                   ` Kamble, Nitin A
@ 2016-09-16 19:23                     ` Somnath Roy
  2016-09-16 20:25                       ` Kamble, Nitin A
  0 siblings, 1 reply; 28+ messages in thread
From: Somnath Roy @ 2016-09-16 19:23 UTC (permalink / raw)
  To: Kamble, Nitin A; +Cc: Sage Weil, Ceph Development

How you configured bluestore, all default ? i.e all in single partition , no separate partition for db/wal ?
Wondering if you are out of db space/disk space ?
We had some issues in this front sometimes back which was fixed, may be a new issue (?). Need verbose log for at least bluefs (debug_bluefs = 20/20)
BTW, what is your workload (block size, IO pattern ) ?

-----Original Message-----
From: Kamble, Nitin A [mailto:Nitin.Kamble@Teradata.com]
Sent: Friday, September 16, 2016 12:00 PM
To: Somnath Roy
Cc: Sage Weil; Ceph Development
Subject: Re: Bluestore OSD support in ceph-disk


> On Sep 16, 2016, at 11:43 AM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>
> Please send the snippet (very first trace , go up in the log) where it is actually printing the assert.
> BTW, what workload you are running ?
>
> Thanks & Regards
> Somnath
>
Here it is.

2016-09-16 08:49:30.605845 7fb5a96ba700 -1 /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_allocate(uint8_t, uint64_t, std::vecto r<bluefs_extent_t>*)' thread 7fb5a96ba700 time 2016-09-16 08:49:30.602139
/build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/BlueFS.cc: 1686: FAILED assert(0 == "allocate failed... wtf")

 ceph version v11.0.0-2309-g9096ad3 (9096ad37f2c0798c26d7784fb4e7a781feb72cb8)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7fb5bf43a11b]
 2: (BlueFS::_allocate(unsigned char, unsigned long, std::vector<bluefs_extent_t, std::allocator<bluefs_extent_t> >*)+0x8ad) [0x7fb5bf2735dd]
 3: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0xb4f) [0x7fb5bf27aa1f]
 4: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x29b) [0x7fb5bf27bc9b]
 5: (BlueRocksWritableFile::Sync()+0x4e) [0x7fb5bf29125e]
 6: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x139) [0x7fb5bf388699]
 7: (rocksdb::WritableFileWriter::Sync(bool)+0x88) [0x7fb5bf389238]
 8: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool)+0x13cf) [0x7fb5bf2e0a2f]
 9: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x27) [0x7fb5bf2e1637]
 10: (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x5b) [0x7fb5bf21a14b]
 11: (BlueStore::_kv_sync_thread()+0xf5a) [0x7fb5bf1e7ffa]
 12: (BlueStore::KVSyncThread::entry()+0xd) [0x7fb5bf1f5a6d]
 13: (()+0x80a4) [0x7fb5bb4a70a4]
 14: (clone()+0x6d) [0x7fb5ba32004d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.



Thanks,
Nitin

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bluestore OSD support in ceph-disk
  2016-09-16 19:23                     ` Somnath Roy
@ 2016-09-16 20:25                       ` Kamble, Nitin A
  2016-09-16 20:36                         ` Somnath Roy
  2016-09-16 20:54                         ` Sage Weil
  0 siblings, 2 replies; 28+ messages in thread
From: Kamble, Nitin A @ 2016-09-16 20:25 UTC (permalink / raw)
  To: Somnath Roy; +Cc: Sage Weil, Ceph Development


> On Sep 16, 2016, at 12:23 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> 
> How you configured bluestore, all default ? i.e all in single partition , no separate partition for db/wal ?
It is separated partitions for data(SSD), wal(SSD), rocksdb(SSD), & block store (HDD).

> Wondering if you are out of db space/disk space ?
I notice a misconfiguration on the cluster now. The wal & db partition use got swapped, so it is getting just 128MB db partition now. Probably this is the cause of the assert.

> We had some issues in this front sometimes back which was fixed, may be a new issue (?). Need verbose log for at least bluefs (debug_bluefs = 20/20)

Let me fix the cluster configuration, to give better space to the DB partition. And if with that this issue comes up then I will try capturing detailed logs.

> BTW, what is your workload (block size, IO pattern ) ?

The workload is internal teradata benchmark, which simulates IO pattern of database disk access with various block sizes and IO pattern. 

Thanks,
Nitin



> 
> -----Original Message-----
> From: Kamble, Nitin A [mailto:Nitin.Kamble@Teradata.com]
> Sent: Friday, September 16, 2016 12:00 PM
> To: Somnath Roy
> Cc: Sage Weil; Ceph Development
> Subject: Re: Bluestore OSD support in ceph-disk
> 
> 
>> On Sep 16, 2016, at 11:43 AM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>> 
>> Please send the snippet (very first trace , go up in the log) where it is actually printing the assert.
>> BTW, what workload you are running ?
>> 
>> Thanks & Regards
>> Somnath
>> 
> Here it is.
> 
> 2016-09-16 08:49:30.605845 7fb5a96ba700 -1 /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_allocate(uint8_t, uint64_t, std::vecto r<bluefs_extent_t>*)' thread 7fb5a96ba700 time 2016-09-16 08:49:30.602139
> /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/BlueFS.cc: 1686: FAILED assert(0 == "allocate failed... wtf")
> 
> ceph version v11.0.0-2309-g9096ad3 (9096ad37f2c0798c26d7784fb4e7a781feb72cb8)
> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7fb5bf43a11b]
> 2: (BlueFS::_allocate(unsigned char, unsigned long, std::vector<bluefs_extent_t, std::allocator<bluefs_extent_t> >*)+0x8ad) [0x7fb5bf2735dd]
> 3: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0xb4f) [0x7fb5bf27aa1f]
> 4: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x29b) [0x7fb5bf27bc9b]
> 5: (BlueRocksWritableFile::Sync()+0x4e) [0x7fb5bf29125e]
> 6: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x139) [0x7fb5bf388699]
> 7: (rocksdb::WritableFileWriter::Sync(bool)+0x88) [0x7fb5bf389238]
> 8: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool)+0x13cf) [0x7fb5bf2e0a2f]
> 9: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x27) [0x7fb5bf2e1637]
> 10: (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x5b) [0x7fb5bf21a14b]
> 11: (BlueStore::_kv_sync_thread()+0xf5a) [0x7fb5bf1e7ffa]
> 12: (BlueStore::KVSyncThread::entry()+0xd) [0x7fb5bf1f5a6d]
> 13: (()+0x80a4) [0x7fb5bb4a70a4]
> 14: (clone()+0x6d) [0x7fb5ba32004d]
> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> 
> 
> 
> Thanks,
> Nitin
> 
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Bluestore OSD support in ceph-disk
  2016-09-16 20:25                       ` Kamble, Nitin A
@ 2016-09-16 20:36                         ` Somnath Roy
  2016-09-16 20:54                         ` Sage Weil
  1 sibling, 0 replies; 28+ messages in thread
From: Somnath Roy @ 2016-09-16 20:36 UTC (permalink / raw)
  To: Kamble, Nitin A; +Cc: Sage Weil, Ceph Development

Bluestore does share space between data and db partition , so, as long as there is space in either of the partitions it shouldn't be running out of space.
Anyways, verbose log should help.
Also, if you are building the code logging the return value just before the assert would help.

-----Original Message-----
From: Kamble, Nitin A [mailto:Nitin.Kamble@Teradata.com] 
Sent: Friday, September 16, 2016 1:25 PM
To: Somnath Roy
Cc: Sage Weil; Ceph Development
Subject: Re: Bluestore OSD support in ceph-disk


> On Sep 16, 2016, at 12:23 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> 
> How you configured bluestore, all default ? i.e all in single partition , no separate partition for db/wal ?
It is separated partitions for data(SSD), wal(SSD), rocksdb(SSD), & block store (HDD).

> Wondering if you are out of db space/disk space ?
I notice a misconfiguration on the cluster now. The wal & db partition use got swapped, so it is getting just 128MB db partition now. Probably this is the cause of the assert.

> We had some issues in this front sometimes back which was fixed, may be a new issue (?). Need verbose log for at least bluefs (debug_bluefs = 20/20)

Let me fix the cluster configuration, to give better space to the DB partition. And if with that this issue comes up then I will try capturing detailed logs.

> BTW, what is your workload (block size, IO pattern ) ?

The workload is internal teradata benchmark, which simulates IO pattern of database disk access with various block sizes and IO pattern. 

Thanks,
Nitin



> 
> -----Original Message-----
> From: Kamble, Nitin A [mailto:Nitin.Kamble@Teradata.com]
> Sent: Friday, September 16, 2016 12:00 PM
> To: Somnath Roy
> Cc: Sage Weil; Ceph Development
> Subject: Re: Bluestore OSD support in ceph-disk
> 
> 
>> On Sep 16, 2016, at 11:43 AM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>> 
>> Please send the snippet (very first trace , go up in the log) where it is actually printing the assert.
>> BTW, what workload you are running ?
>> 
>> Thanks & Regards
>> Somnath
>> 
> Here it is.
> 
> 2016-09-16 08:49:30.605845 7fb5a96ba700 -1 /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_allocate(uint8_t, uint64_t, std::vecto r<bluefs_extent_t>*)' thread 7fb5a96ba700 time 2016-09-16 08:49:30.602139
> /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/BlueFS.cc: 1686: FAILED assert(0 == "allocate failed... wtf")
> 
> ceph version v11.0.0-2309-g9096ad3 (9096ad37f2c0798c26d7784fb4e7a781feb72cb8)
> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7fb5bf43a11b]
> 2: (BlueFS::_allocate(unsigned char, unsigned long, std::vector<bluefs_extent_t, std::allocator<bluefs_extent_t> >*)+0x8ad) [0x7fb5bf2735dd]
> 3: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0xb4f) [0x7fb5bf27aa1f]
> 4: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x29b) [0x7fb5bf27bc9b]
> 5: (BlueRocksWritableFile::Sync()+0x4e) [0x7fb5bf29125e]
> 6: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x139) [0x7fb5bf388699]
> 7: (rocksdb::WritableFileWriter::Sync(bool)+0x88) [0x7fb5bf389238]
> 8: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool)+0x13cf) [0x7fb5bf2e0a2f]
> 9: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x27) [0x7fb5bf2e1637]
> 10: (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x5b) [0x7fb5bf21a14b]
> 11: (BlueStore::_kv_sync_thread()+0xf5a) [0x7fb5bf1e7ffa]
> 12: (BlueStore::KVSyncThread::entry()+0xd) [0x7fb5bf1f5a6d]
> 13: (()+0x80a4) [0x7fb5bb4a70a4]
> 14: (clone()+0x6d) [0x7fb5ba32004d]
> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> 
> 
> 
> Thanks,
> Nitin
> 
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bluestore OSD support in ceph-disk
  2016-09-16 20:25                       ` Kamble, Nitin A
  2016-09-16 20:36                         ` Somnath Roy
@ 2016-09-16 20:54                         ` Sage Weil
  2016-09-16 23:09                           ` Kamble, Nitin A
  1 sibling, 1 reply; 28+ messages in thread
From: Sage Weil @ 2016-09-16 20:54 UTC (permalink / raw)
  To: Kamble, Nitin A; +Cc: Somnath Roy, Ceph Development

On Fri, 16 Sep 2016, Kamble, Nitin A wrote:
> > On Sep 16, 2016, at 12:23 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> > 
> > How you configured bluestore, all default ? i.e all in single partition , no separate partition for db/wal ?
> It is separated partitions for data(SSD), wal(SSD), rocksdb(SSD), & block store (HDD).
> 
> > Wondering if you are out of db space/disk space ?
> I notice a misconfiguration on the cluster now. The wal & db partition use got swapped, so it is getting just 128MB db partition now. Probably this is the cause of the assert.

FWIW bluefs is supposed to fall back on any allocation failure to the next 
larger/slower device (wal -> db -> primary), so having a tiny wal or tiny 
db shouldn't actually matter.  A bluefs log (debug bluefs = 10 or 20) 
leading up to any crash there would be helpful.

Thanks!
sage


> > We had some issues in this front sometimes back which was fixed, may be a new issue (?). Need verbose log for at least bluefs (debug_bluefs = 20/20)
> 
> Let me fix the cluster configuration, to give better space to the DB partition. And if with that this issue comes up then I will try capturing detailed logs.
> 
> > BTW, what is your workload (block size, IO pattern ) ?
> 
> The workload is internal teradata benchmark, which simulates IO pattern of database disk access with various block sizes and IO pattern. 
> 
> Thanks,
> Nitin
> 
> 
> 
> > 
> > -----Original Message-----
> > From: Kamble, Nitin A [mailto:Nitin.Kamble@Teradata.com]
> > Sent: Friday, September 16, 2016 12:00 PM
> > To: Somnath Roy
> > Cc: Sage Weil; Ceph Development
> > Subject: Re: Bluestore OSD support in ceph-disk
> > 
> > 
> >> On Sep 16, 2016, at 11:43 AM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> >> 
> >> Please send the snippet (very first trace , go up in the log) where it is actually printing the assert.
> >> BTW, what workload you are running ?
> >> 
> >> Thanks & Regards
> >> Somnath
> >> 
> > Here it is.
> > 
> > 2016-09-16 08:49:30.605845 7fb5a96ba700 -1 /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_allocate(uint8_t, uint64_t, std::vecto r<bluefs_extent_t>*)' thread 7fb5a96ba700 time 2016-09-16 08:49:30.602139
> > /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/BlueFS.cc: 1686: FAILED assert(0 == "allocate failed... wtf")
> > 
> > ceph version v11.0.0-2309-g9096ad3 (9096ad37f2c0798c26d7784fb4e7a781feb72cb8)
> > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7fb5bf43a11b]
> > 2: (BlueFS::_allocate(unsigned char, unsigned long, std::vector<bluefs_extent_t, std::allocator<bluefs_extent_t> >*)+0x8ad) [0x7fb5bf2735dd]
> > 3: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0xb4f) [0x7fb5bf27aa1f]
> > 4: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x29b) [0x7fb5bf27bc9b]
> > 5: (BlueRocksWritableFile::Sync()+0x4e) [0x7fb5bf29125e]
> > 6: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x139) [0x7fb5bf388699]
> > 7: (rocksdb::WritableFileWriter::Sync(bool)+0x88) [0x7fb5bf389238]
> > 8: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool)+0x13cf) [0x7fb5bf2e0a2f]
> > 9: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x27) [0x7fb5bf2e1637]
> > 10: (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x5b) [0x7fb5bf21a14b]
> > 11: (BlueStore::_kv_sync_thread()+0xf5a) [0x7fb5bf1e7ffa]
> > 12: (BlueStore::KVSyncThread::entry()+0xd) [0x7fb5bf1f5a6d]
> > 13: (()+0x80a4) [0x7fb5bb4a70a4]
> > 14: (clone()+0x6d) [0x7fb5ba32004d]
> > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> > 
> > 
> > 
> > Thanks,
> > Nitin
> > 
> > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bluestore OSD support in ceph-disk
  2016-09-16 20:54                         ` Sage Weil
@ 2016-09-16 23:09                           ` Kamble, Nitin A
  2016-09-16 23:15                             ` Somnath Roy
  2016-09-17 14:14                             ` Sage Weil
  0 siblings, 2 replies; 28+ messages in thread
From: Kamble, Nitin A @ 2016-09-16 23:09 UTC (permalink / raw)
  To: Sage Weil; +Cc: Somnath Roy, Ceph Development


> On Sep 16, 2016, at 1:54 PM, Sage Weil <sage@newdream.net> wrote:
> 
> On Fri, 16 Sep 2016, Kamble, Nitin A wrote:
>>> On Sep 16, 2016, at 12:23 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>>> 
>>> How you configured bluestore, all default ? i.e all in single partition , no separate partition for db/wal ?
>> It is separated partitions for data(SSD), wal(SSD), rocksdb(SSD), & block store (HDD).
>> 
>>> Wondering if you are out of db space/disk space ?
>> I notice a misconfiguration on the cluster now. The wal & db partition use got swapped, so it is getting just 128MB db partition now. Probably this is the cause of the assert.
> 
> FWIW bluefs is supposed to fall back on any allocation failure to the next 
> larger/slower device (wal -> db -> primary), so having a tiny wal or tiny 
> db shouldn't actually matter.  A bluefs log (debug bluefs = 10 or 20) 
> leading up to any crash there would be helpful.
> 
> Thanks!
> sage
> 

Good to know this fall back mechanism.
In my previous run the partitions and sizes in config did not match. I see the ceph-daemon-dump
showing 900MB+ used for db while db partition was 128MB. I was thinking it started overwriting on
to the next partition. But instead as per this backup logic it started using HDD for db.

One issue I see is that, ceph.conf lists the sizes of the wal,db,& block. But it is possible that actual
partitions may have different sizes. From the ceph-daemon-dump output looks like it is not looking
at the partition’s real size, instead the code is assuming the sizes from the config file as the partition
sizes. I think probing of the size of existing devices/files will be better than taking the sizes from the
config file blindly.

After 5 hours or so 6+ osds were down out of 30.
We will be running the stress test one again with the fixed partition configuration with debug level of
0, to get max performance out. And if that fails then I will switch to debug level 10 or 20, and gather
some detailed logs.

Thanks,
Nitin

> 
>>> We had some issues in this front sometimes back which was fixed, may be a new issue (?). Need verbose log for at least bluefs (debug_bluefs = 20/20)
>> 
>> Let me fix the cluster configuration, to give better space to the DB partition. And if with that this issue comes up then I will try capturing detailed logs.
>> 
>>> BTW, what is your workload (block size, IO pattern ) ?
>> 
>> The workload is internal teradata benchmark, which simulates IO pattern of database disk access with various block sizes and IO pattern. 
>> 
>> Thanks,
>> Nitin
>> 
>> 
>> 
>>> 
>>> -----Original Message-----
>>> From: Kamble, Nitin A [mailto:Nitin.Kamble@Teradata.com]
>>> Sent: Friday, September 16, 2016 12:00 PM
>>> To: Somnath Roy
>>> Cc: Sage Weil; Ceph Development
>>> Subject: Re: Bluestore OSD support in ceph-disk
>>> 
>>> 
>>>> On Sep 16, 2016, at 11:43 AM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>>>> 
>>>> Please send the snippet (very first trace , go up in the log) where it is actually printing the assert.
>>>> BTW, what workload you are running ?
>>>> 
>>>> Thanks & Regards
>>>> Somnath
>>>> 
>>> Here it is.
>>> 
>>> 2016-09-16 08:49:30.605845 7fb5a96ba700 -1 /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_allocate(uint8_t, uint64_t, std::vecto r<bluefs_extent_t>*)' thread 7fb5a96ba700 time 2016-09-16 08:49:30.602139
>>> /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/BlueFS.cc: 1686: FAILED assert(0 == "allocate failed... wtf")
>>> 
>>> ceph version v11.0.0-2309-g9096ad3 (9096ad37f2c0798c26d7784fb4e7a781feb72cb8)
>>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7fb5bf43a11b]
>>> 2: (BlueFS::_allocate(unsigned char, unsigned long, std::vector<bluefs_extent_t, std::allocator<bluefs_extent_t> >*)+0x8ad) [0x7fb5bf2735dd]
>>> 3: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0xb4f) [0x7fb5bf27aa1f]
>>> 4: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x29b) [0x7fb5bf27bc9b]
>>> 5: (BlueRocksWritableFile::Sync()+0x4e) [0x7fb5bf29125e]
>>> 6: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x139) [0x7fb5bf388699]
>>> 7: (rocksdb::WritableFileWriter::Sync(bool)+0x88) [0x7fb5bf389238]
>>> 8: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool)+0x13cf) [0x7fb5bf2e0a2f]
>>> 9: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x27) [0x7fb5bf2e1637]
>>> 10: (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x5b) [0x7fb5bf21a14b]
>>> 11: (BlueStore::_kv_sync_thread()+0xf5a) [0x7fb5bf1e7ffa]
>>> 12: (BlueStore::KVSyncThread::entry()+0xd) [0x7fb5bf1f5a6d]
>>> 13: (()+0x80a4) [0x7fb5bb4a70a4]
>>> 14: (clone()+0x6d) [0x7fb5ba32004d]
>>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>> 
>>> 
>>> 
>>> Thanks,
>>> Nitin
>>> 
>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Bluestore OSD support in ceph-disk
  2016-09-16 23:09                           ` Kamble, Nitin A
@ 2016-09-16 23:15                             ` Somnath Roy
  2016-09-18  6:41                               ` Kamble, Nitin A
  2016-09-17 14:14                             ` Sage Weil
  1 sibling, 1 reply; 28+ messages in thread
From: Somnath Roy @ 2016-09-16 23:15 UTC (permalink / raw)
  To: Kamble, Nitin A, Sage Weil; +Cc: Ceph Development

The sizes of the partitions from ceph.conf is not mandatory. 
If you mention , the partition of db/wal/data , it should be using the entire size of the partition. If you mention the sizes on top , it will truncate to that size..Here is one sample config..

[osd.0]
       host = emsnode10
       devs = /dev/sdb1
       bluestore_block_db_path = /dev/sdb2
       bluestore_block_wal_path = /dev/sdb3
       bluestore_block_path = /dev/sdb4

I guess devs is not required for ceph-disk. I am using old mkcephfs stuff and that's why it is needed.

Thanks & Regards
Somnath

-----Original Message-----
From: Kamble, Nitin A [mailto:Nitin.Kamble@Teradata.com] 
Sent: Friday, September 16, 2016 4:09 PM
To: Sage Weil
Cc: Somnath Roy; Ceph Development
Subject: Re: Bluestore OSD support in ceph-disk


> On Sep 16, 2016, at 1:54 PM, Sage Weil <sage@newdream.net> wrote:
> 
> On Fri, 16 Sep 2016, Kamble, Nitin A wrote:
>>> On Sep 16, 2016, at 12:23 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>>> 
>>> How you configured bluestore, all default ? i.e all in single partition , no separate partition for db/wal ?
>> It is separated partitions for data(SSD), wal(SSD), rocksdb(SSD), & block store (HDD).
>> 
>>> Wondering if you are out of db space/disk space ?
>> I notice a misconfiguration on the cluster now. The wal & db partition use got swapped, so it is getting just 128MB db partition now. Probably this is the cause of the assert.
> 
> FWIW bluefs is supposed to fall back on any allocation failure to the 
> next larger/slower device (wal -> db -> primary), so having a tiny wal 
> or tiny db shouldn't actually matter.  A bluefs log (debug bluefs = 10 
> or 20) leading up to any crash there would be helpful.
> 
> Thanks!
> sage
> 

Good to know this fall back mechanism.
In my previous run the partitions and sizes in config did not match. I see the ceph-daemon-dump showing 900MB+ used for db while db partition was 128MB. I was thinking it started overwriting on to the next partition. But instead as per this backup logic it started using HDD for db.

One issue I see is that, ceph.conf lists the sizes of the wal,db,& block. But it is possible that actual partitions may have different sizes. From the ceph-daemon-dump output looks like it is not looking at the partition’s real size, instead the code is assuming the sizes from the config file as the partition sizes. I think probing of the size of existing devices/files will be better than taking the sizes from the config file blindly.

After 5 hours or so 6+ osds were down out of 30.
We will be running the stress test one again with the fixed partition configuration with debug level of 0, to get max performance out. And if that fails then I will switch to debug level 10 or 20, and gather some detailed logs.

Thanks,
Nitin

> 
>>> We had some issues in this front sometimes back which was fixed, may 
>>> be a new issue (?). Need verbose log for at least bluefs 
>>> (debug_bluefs = 20/20)
>> 
>> Let me fix the cluster configuration, to give better space to the DB partition. And if with that this issue comes up then I will try capturing detailed logs.
>> 
>>> BTW, what is your workload (block size, IO pattern ) ?
>> 
>> The workload is internal teradata benchmark, which simulates IO pattern of database disk access with various block sizes and IO pattern. 
>> 
>> Thanks,
>> Nitin
>> 
>> 
>> 
>>> 
>>> -----Original Message-----
>>> From: Kamble, Nitin A [mailto:Nitin.Kamble@Teradata.com]
>>> Sent: Friday, September 16, 2016 12:00 PM
>>> To: Somnath Roy
>>> Cc: Sage Weil; Ceph Development
>>> Subject: Re: Bluestore OSD support in ceph-disk
>>> 
>>> 
>>>> On Sep 16, 2016, at 11:43 AM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>>>> 
>>>> Please send the snippet (very first trace , go up in the log) where it is actually printing the assert.
>>>> BTW, what workload you are running ?
>>>> 
>>>> Thanks & Regards
>>>> Somnath
>>>> 
>>> Here it is.
>>> 
>>> 2016-09-16 08:49:30.605845 7fb5a96ba700 -1 
>>> /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild
>>> /BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/BlueFS.cc: In 
>>> function 'int BlueFS::_allocate(uint8_t, uint64_t, std::vecto 
>>> r<bluefs_extent_t>*)' thread 7fb5a96ba700 time 2016-09-16 
>>> 08:49:30.602139
>>> /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild
>>> /BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/BlueFS.cc: 1686: 
>>> FAILED assert(0 == "allocate failed... wtf")
>>> 
>>> ceph version v11.0.0-2309-g9096ad3 
>>> (9096ad37f2c0798c26d7784fb4e7a781feb72cb8)
>>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
>>> const*)+0x8b) [0x7fb5bf43a11b]
>>> 2: (BlueFS::_allocate(unsigned char, unsigned long, 
>>> std::vector<bluefs_extent_t, std::allocator<bluefs_extent_t> 
>>> >*)+0x8ad) [0x7fb5bf2735dd]
>>> 3: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, 
>>> unsigned long, unsigned long)+0xb4f) [0x7fb5bf27aa1f]
>>> 4: (BlueFS::_fsync(BlueFS::FileWriter*, 
>>> std::unique_lock<std::mutex>&)+0x29b) [0x7fb5bf27bc9b]
>>> 5: (BlueRocksWritableFile::Sync()+0x4e) [0x7fb5bf29125e]
>>> 6: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x139) 
>>> [0x7fb5bf388699]
>>> 7: (rocksdb::WritableFileWriter::Sync(bool)+0x88) [0x7fb5bf389238]
>>> 8: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, 
>>> rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, 
>>> unsigned long, bool)+0x13cf) [0x7fb5bf2e0a2f]
>>> 9: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, 
>>> rocksdb::WriteBatch*)+0x27) [0x7fb5bf2e1637]
>>> 10: 
>>> (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::T
>>> ransactionImpl>)+0x5b) [0x7fb5bf21a14b]
>>> 11: (BlueStore::_kv_sync_thread()+0xf5a) [0x7fb5bf1e7ffa]
>>> 12: (BlueStore::KVSyncThread::entry()+0xd) [0x7fb5bf1f5a6d]
>>> 13: (()+0x80a4) [0x7fb5bb4a70a4]
>>> 14: (clone()+0x6d) [0x7fb5ba32004d]
>>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>> 
>>> 
>>> 
>>> Thanks,
>>> Nitin
>>> 
>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bluestore OSD support in ceph-disk
  2016-09-16 23:09                           ` Kamble, Nitin A
  2016-09-16 23:15                             ` Somnath Roy
@ 2016-09-17 14:14                             ` Sage Weil
  2016-09-18  6:41                               ` Kamble, Nitin A
  1 sibling, 1 reply; 28+ messages in thread
From: Sage Weil @ 2016-09-17 14:14 UTC (permalink / raw)
  To: Kamble, Nitin A; +Cc: Somnath Roy, Ceph Development

[-- Attachment #1: Type: TEXT/PLAIN, Size: 6751 bytes --]

On Fri, 16 Sep 2016, Kamble, Nitin A wrote:
> > On Sep 16, 2016, at 1:54 PM, Sage Weil <sage@newdream.net> wrote:
> > 
> > On Fri, 16 Sep 2016, Kamble, Nitin A wrote:
> >>> On Sep 16, 2016, at 12:23 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> >>> 
> >>> How you configured bluestore, all default ? i.e all in single partition , no separate partition for db/wal ?
> >> It is separated partitions for data(SSD), wal(SSD), rocksdb(SSD), & block store (HDD).
> >> 
> >>> Wondering if you are out of db space/disk space ?
> >> I notice a misconfiguration on the cluster now. The wal & db partition use got swapped, so it is getting just 128MB db partition now. Probably this is the cause of the assert.
> > 
> > FWIW bluefs is supposed to fall back on any allocation failure to the next 
> > larger/slower device (wal -> db -> primary), so having a tiny wal or tiny 
> > db shouldn't actually matter.  A bluefs log (debug bluefs = 10 or 20) 
> > leading up to any crash there would be helpful.
> > 
> > Thanks!
> > sage
> > 
> 
> Good to know this fall back mechanism.
> In my previous run the partitions and sizes in config did not match. I see the ceph-daemon-dump
> showing 900MB+ used for db while db partition was 128MB. I was thinking it started overwriting on
> to the next partition. But instead as per this backup logic it started using HDD for db.
> 
> One issue I see is that, ceph.conf lists the sizes of the wal,db,& block. But it is possible that actual
> partitions may have different sizes. From the ceph-daemon-dump output looks like it is not looking
> at the partition’s real size, instead the code is assuming the sizes from the config file as the partition
> sizes. I think probing of the size of existing devices/files will be better than taking the sizes from the
> config file blindly.

Those bluestore_block*_size options are only used in conjunction with 
bluestore_block*_create = true to create a dummy block device file for 
debugging/testing purposes.

Are you referring to the perf counters (ceph daemon osd.N perf dump) 
figures?  Can you share the numbers you saw?  It sounds like there might 
be a bug in the instrumentation.

> After 5 hours or so 6+ osds were down out of 30.
> We will be running the stress test one again with the fixed partition configuration with debug level of
> 0, to get max performance out. And if that fails then I will switch to debug level 10 or 20, and gather
> some detailed logs.

Any backtraces you see would be helpful.

Thanks!
sage


> 
> Thanks,
> Nitin
> 
> > 
> >>> We had some issues in this front sometimes back which was fixed, may be a new issue (?). Need verbose log for at least bluefs (debug_bluefs = 20/20)
> >> 
> >> Let me fix the cluster configuration, to give better space to the DB partition. And if with that this issue comes up then I will try capturing detailed logs.
> >> 
> >>> BTW, what is your workload (block size, IO pattern ) ?
> >> 
> >> The workload is internal teradata benchmark, which simulates IO pattern of database disk access with various block sizes and IO pattern. 
> >> 
> >> Thanks,
> >> Nitin
> >> 
> >> 
> >> 
> >>> 
> >>> -----Original Message-----
> >>> From: Kamble, Nitin A [mailto:Nitin.Kamble@Teradata.com]
> >>> Sent: Friday, September 16, 2016 12:00 PM
> >>> To: Somnath Roy
> >>> Cc: Sage Weil; Ceph Development
> >>> Subject: Re: Bluestore OSD support in ceph-disk
> >>> 
> >>> 
> >>>> On Sep 16, 2016, at 11:43 AM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> >>>> 
> >>>> Please send the snippet (very first trace , go up in the log) where it is actually printing the assert.
> >>>> BTW, what workload you are running ?
> >>>> 
> >>>> Thanks & Regards
> >>>> Somnath
> >>>> 
> >>> Here it is.
> >>> 
> >>> 2016-09-16 08:49:30.605845 7fb5a96ba700 -1 /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_allocate(uint8_t, uint64_t, std::vecto r<bluefs_extent_t>*)' thread 7fb5a96ba700 time 2016-09-16 08:49:30.602139
> >>> /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/BlueFS.cc: 1686: FAILED assert(0 == "allocate failed... wtf")
> >>> 
> >>> ceph version v11.0.0-2309-g9096ad3 (9096ad37f2c0798c26d7784fb4e7a781feb72cb8)
> >>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7fb5bf43a11b]
> >>> 2: (BlueFS::_allocate(unsigned char, unsigned long, std::vector<bluefs_extent_t, std::allocator<bluefs_extent_t> >*)+0x8ad) [0x7fb5bf2735dd]
> >>> 3: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0xb4f) [0x7fb5bf27aa1f]
> >>> 4: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x29b) [0x7fb5bf27bc9b]
> >>> 5: (BlueRocksWritableFile::Sync()+0x4e) [0x7fb5bf29125e]
> >>> 6: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x139) [0x7fb5bf388699]
> >>> 7: (rocksdb::WritableFileWriter::Sync(bool)+0x88) [0x7fb5bf389238]
> >>> 8: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool)+0x13cf) [0x7fb5bf2e0a2f]
> >>> 9: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x27) [0x7fb5bf2e1637]
> >>> 10: (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x5b) [0x7fb5bf21a14b]
> >>> 11: (BlueStore::_kv_sync_thread()+0xf5a) [0x7fb5bf1e7ffa]
> >>> 12: (BlueStore::KVSyncThread::entry()+0xd) [0x7fb5bf1f5a6d]
> >>> 13: (()+0x80a4) [0x7fb5bb4a70a4]
> >>> 14: (clone()+0x6d) [0x7fb5ba32004d]
> >>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> >>> 
> >>> 
> >>> 
> >>> Thanks,
> >>> Nitin
> >>> 
> >>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> >> 
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> N?????r??y??????X??ǧv???)޺{.n?????z?]z????ay?\x1dʇڙ??j\a??f???h?????\x1e?w???\f???j:+v???w????????\a????zZ+???????j"????i

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bluestore OSD support in ceph-disk
  2016-09-16 23:15                             ` Somnath Roy
@ 2016-09-18  6:41                               ` Kamble, Nitin A
  2016-09-18  7:06                                 ` Varada Kari
  0 siblings, 1 reply; 28+ messages in thread
From: Kamble, Nitin A @ 2016-09-18  6:41 UTC (permalink / raw)
  To: Somnath Roy; +Cc: Sage Weil, Ceph Development


> On Sep 16, 2016, at 4:15 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> 
> The sizes of the partitions from ceph.conf is not mandatory. 
> If you mention , the partition of db/wal/data , it should be using the entire size of the partition. If you mention the sizes on top , it will truncate to that size..Here is one sample config..
> 
> [osd.0]
>       host = emsnode10
>       devs = /dev/sdb1
>       bluestore_block_db_path = /dev/sdb2
>       bluestore_block_wal_path = /dev/sdb3
>       bluestore_block_path = /dev/sdb4
> 
> I guess devs is not required for ceph-disk. I am using old mkcephfs stuff and that's why it is needed.
> 
> Thanks & Regards
> Somnath
> 

Instead of using the config for osd definition I am creating the OSDs manually using a script in which all 
the wal,db,block links are created. Even if one does not specify the partitions sizes, there are defaults
predefined. Is there a way to ignore the size from the config and use the size by probing the device?

Thanks,
Nitin

> -----Original Message-----
> From: Kamble, Nitin A [mailto:Nitin.Kamble@Teradata.com] 
> Sent: Friday, September 16, 2016 4:09 PM
> To: Sage Weil
> Cc: Somnath Roy; Ceph Development
> Subject: Re: Bluestore OSD support in ceph-disk
> 
> 
>> On Sep 16, 2016, at 1:54 PM, Sage Weil <sage@newdream.net> wrote:
>> 
>> On Fri, 16 Sep 2016, Kamble, Nitin A wrote:
>>>> On Sep 16, 2016, at 12:23 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>>>> 
>>>> How you configured bluestore, all default ? i.e all in single partition , no separate partition for db/wal ?
>>> It is separated partitions for data(SSD), wal(SSD), rocksdb(SSD), & block store (HDD).
>>> 
>>>> Wondering if you are out of db space/disk space ?
>>> I notice a misconfiguration on the cluster now. The wal & db partition use got swapped, so it is getting just 128MB db partition now. Probably this is the cause of the assert.
>> 
>> FWIW bluefs is supposed to fall back on any allocation failure to the 
>> next larger/slower device (wal -> db -> primary), so having a tiny wal 
>> or tiny db shouldn't actually matter.  A bluefs log (debug bluefs = 10 
>> or 20) leading up to any crash there would be helpful.
>> 
>> Thanks!
>> sage
>> 
> 
> Good to know this fall back mechanism.
> In my previous run the partitions and sizes in config did not match. I see the ceph-daemon-dump showing 900MB+ used for db while db partition was 128MB. I was thinking it started overwriting on to the next partition. But instead as per this backup logic it started using HDD for db.
> 
> One issue I see is that, ceph.conf lists the sizes of the wal,db,& block. But it is possible that actual partitions may have different sizes. From the ceph-daemon-dump output looks like it is not looking at the partition’s real size, instead the code is assuming the sizes from the config file as the partition sizes. I think probing of the size of existing devices/files will be better than taking the sizes from the config file blindly.
> 
> After 5 hours or so 6+ osds were down out of 30.
> We will be running the stress test one again with the fixed partition configuration with debug level of 0, to get max performance out. And if that fails then I will switch to debug level 10 or 20, and gather some detailed logs.
> 
> Thanks,
> Nitin
> 
>> 
>>>> We had some issues in this front sometimes back which was fixed, may 
>>>> be a new issue (?). Need verbose log for at least bluefs 
>>>> (debug_bluefs = 20/20)
>>> 
>>> Let me fix the cluster configuration, to give better space to the DB partition. And if with that this issue comes up then I will try capturing detailed logs.
>>> 
>>>> BTW, what is your workload (block size, IO pattern ) ?
>>> 
>>> The workload is internal teradata benchmark, which simulates IO pattern of database disk access with various block sizes and IO pattern. 
>>> 
>>> Thanks,
>>> Nitin
>>> 
>>> 
>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: Kamble, Nitin A [mailto:Nitin.Kamble@Teradata.com]
>>>> Sent: Friday, September 16, 2016 12:00 PM
>>>> To: Somnath Roy
>>>> Cc: Sage Weil; Ceph Development
>>>> Subject: Re: Bluestore OSD support in ceph-disk
>>>> 
>>>> 
>>>>> On Sep 16, 2016, at 11:43 AM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>>>>> 
>>>>> Please send the snippet (very first trace , go up in the log) where it is actually printing the assert.
>>>>> BTW, what workload you are running ?
>>>>> 
>>>>> Thanks & Regards
>>>>> Somnath
>>>>> 
>>>> Here it is.
>>>> 
>>>> 2016-09-16 08:49:30.605845 7fb5a96ba700 -1 
>>>> /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild
>>>> /BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/BlueFS.cc: In 
>>>> function 'int BlueFS::_allocate(uint8_t, uint64_t, std::vecto 
>>>> r<bluefs_extent_t>*)' thread 7fb5a96ba700 time 2016-09-16 
>>>> 08:49:30.602139
>>>> /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild
>>>> /BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/BlueFS.cc: 1686: 
>>>> FAILED assert(0 == "allocate failed... wtf")
>>>> 
>>>> ceph version v11.0.0-2309-g9096ad3 
>>>> (9096ad37f2c0798c26d7784fb4e7a781feb72cb8)
>>>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
>>>> const*)+0x8b) [0x7fb5bf43a11b]
>>>> 2: (BlueFS::_allocate(unsigned char, unsigned long, 
>>>> std::vector<bluefs_extent_t, std::allocator<bluefs_extent_t> 
>>>>> *)+0x8ad) [0x7fb5bf2735dd]
>>>> 3: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, 
>>>> unsigned long, unsigned long)+0xb4f) [0x7fb5bf27aa1f]
>>>> 4: (BlueFS::_fsync(BlueFS::FileWriter*, 
>>>> std::unique_lock<std::mutex>&)+0x29b) [0x7fb5bf27bc9b]
>>>> 5: (BlueRocksWritableFile::Sync()+0x4e) [0x7fb5bf29125e]
>>>> 6: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x139) 
>>>> [0x7fb5bf388699]
>>>> 7: (rocksdb::WritableFileWriter::Sync(bool)+0x88) [0x7fb5bf389238]
>>>> 8: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, 
>>>> rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, 
>>>> unsigned long, bool)+0x13cf) [0x7fb5bf2e0a2f]
>>>> 9: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, 
>>>> rocksdb::WriteBatch*)+0x27) [0x7fb5bf2e1637]
>>>> 10: 
>>>> (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::T
>>>> ransactionImpl>)+0x5b) [0x7fb5bf21a14b]
>>>> 11: (BlueStore::_kv_sync_thread()+0xf5a) [0x7fb5bf1e7ffa]
>>>> 12: (BlueStore::KVSyncThread::entry()+0xd) [0x7fb5bf1f5a6d]
>>>> 13: (()+0x80a4) [0x7fb5bb4a70a4]
>>>> 14: (clone()+0x6d) [0x7fb5ba32004d]
>>>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>> 
>>>> 
>>>> 
>>>> Thanks,
>>>> Nitin
>>>> 
>>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>>> info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bluestore OSD support in ceph-disk
  2016-09-17 14:14                             ` Sage Weil
@ 2016-09-18  6:41                               ` Kamble, Nitin A
  0 siblings, 0 replies; 28+ messages in thread
From: Kamble, Nitin A @ 2016-09-18  6:41 UTC (permalink / raw)
  To: Sage Weil; +Cc: Somnath Roy, Ceph Development


> On Sep 17, 2016, at 7:14 AM, Sage Weil <sage@newdream.net> wrote:
> 
> On Fri, 16 Sep 2016, Kamble, Nitin A wrote:
>>> On Sep 16, 2016, at 1:54 PM, Sage Weil <sage@newdream.net> wrote:
>>> 
>>> On Fri, 16 Sep 2016, Kamble, Nitin A wrote:
>>>>> On Sep 16, 2016, at 12:23 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>>>>> 
>>>>> How you configured bluestore, all default ? i.e all in single partition , no separate partition for db/wal ?
>>>> It is separated partitions for data(SSD), wal(SSD), rocksdb(SSD), & block store (HDD).
>>>> 
>>>>> Wondering if you are out of db space/disk space ?
>>>> I notice a misconfiguration on the cluster now. The wal & db partition use got swapped, so it is getting just 128MB db partition now. Probably this is the cause of the assert.
>>> 
>>> FWIW bluefs is supposed to fall back on any allocation failure to the next 
>>> larger/slower device (wal -> db -> primary), so having a tiny wal or tiny 
>>> db shouldn't actually matter.  A bluefs log (debug bluefs = 10 or 20) 
>>> leading up to any crash there would be helpful.
>>> 
>>> Thanks!
>>> sage
>>> 
>> 
>> Good to know this fall back mechanism.
>> In my previous run the partitions and sizes in config did not match. I see the ceph-daemon-dump
>> showing 900MB+ used for db while db partition was 128MB. I was thinking it started overwriting on
>> to the next partition. But instead as per this backup logic it started using HDD for db.
>> 
>> One issue I see is that, ceph.conf lists the sizes of the wal,db,& block. But it is possible that actual
>> partitions may have different sizes. From the ceph-daemon-dump output looks like it is not looking
>> at the partition’s real size, instead the code is assuming the sizes from the config file as the partition
>> sizes. I think probing of the size of existing devices/files will be better than taking the sizes from the
>> config file blindly.
> 
> Those bluestore_block*_size options are only used in conjunction with 
> bluestore_block*_create = true to create a dummy block device file for 
> debugging/testing purposes.

So if I set these config options as false in the ceph.conf then the default sizes should get ignored, right?

> 
> Are you referring to the perf counters (ceph daemon osd.N perf dump) 
> figures?  Can you share the numbers you saw?  It sounds like there might 
> be a bug in the instrumentation.

Yes, I referred to the perf counter you mentioned at the beginning of this thread.
Unfortunately I am resurrecting the cluster frequently to get some performance data out, so 
I cant’ show you the numbers right now. I will try to capture it next time.
The 1st attempt where I messed up partition links, at that time I noticed that the perf stat
for db was showing size from the config file which was much bigger (37GB) than the partition’s
actual size (128MB). There is a possibility I may have made a mistake in identifying the partitions correctly,
as I was focusing on the getting setup working more than looking for possible bugs in the implementation.
But it will be easy and quick to just try it out to verify which partition size is shown in the perf stat.

BTW In another run few OSDs died with the following errors. Here the partition layout was not messed up.

osd.2 

2016-09-17 02:36:18.460357 7f1fd38c9700 -1 *** Caught signal (Aborted) **
 in thread 7f1fd38c9700 thread_name:tp_osd_tp

 ceph version v11.0.0-2309-g9096ad3 (9096ad37f2c0798c26d7784fb4e7a781feb72cb8)
 1: (()+0x892dd2) [0x7f1ff24d9dd2]
 2: (()+0xf890) [0x7f1fee6cf890]
 3: (pthread_cond_wait()+0xbf) [0x7f1fee6cc05f]
 4: (Throttle::_wait(long)+0x1a4) [0x7f1ff2651994]
 5: (Throttle::get(long, long)+0xc7) [0x7f1ff26525f7]
 6: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<TrackedOp>, Thread
Pool::TPHandle*)+0x8a3) [0x7f1ff2407ad3]
 7: (ReplicatedPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<OpRequest>)+0x7c) [0x7f1ff21ec93c]
 8: (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequest>)+0xb8f) [0x7f1ff22bcf0f]
 9: (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0x3b3) [0x7f1ff22bd883]
 10: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xfa) [0x7f1ff218c8da]
 11: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f5) [0x7f1ff2045905]
 12: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest> const&)+0x5d) [0x7f1ff2045b2d]
 13: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x874) [0x7f1ff2066e84]
 14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x947) [0x7f1ff2660a37]
 15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f1ff2662b90]
 16: (()+0x80a4) [0x7f1fee6c80a4]
 17: (clone()+0x6d) [0x7f1fed54104d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Osd.3 


2016-09-16 23:10:39.690923 7f1bb9c6f700 -1 /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/BlueFS.cc: In f
unction 'int BlueFS::_allocate(uint8_t, uint64_t, std::vector<bluefs_extent_t>*)' thread 7f1bb9c6f700 time 2016-09-16 23:10:39.686574
/build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/BlueFS.cc: 1686: FAILED assert(0 == "allocate failed... wt
f")

 ceph version v11.0.0-2309-g9096ad3 (9096ad37f2c0798c26d7784fb4e7a781feb72cb8)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7f1be2a1511b]
 2: (BlueFS::_allocate(unsigned char, unsigned long, std::vector<bluefs_extent_t, std::allocator<bluefs_extent_t> >*)+0x8ad) [0x7f1be284e5dd]
 3: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0xb4f) [0x7f1be2855a1f]
 4: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x29b) [0x7f1be2856c9b]
 5: (BlueRocksWritableFile::Sync()+0x4e) [0x7f1be286c25e]
 6: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x139) [0x7f1be2963699]
 7: (rocksdb::WritableFileWriter::Sync(bool)+0x88) [0x7f1be2964238]
 8: (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status const&, rocksdb::CompactionJob::SubcompactionState*)+0x8c3) [0x7f1be29a6fa3]
 9: (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0xb19) [0x7f1be29a8359]
 10: (rocksdb::CompactionJob::Run()+0x1cf) [0x7f1be29a990f]
 11: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, void*)+0xa0c) [0x7f1be28aaf4c]
 12: (rocksdb::DBImpl::BackgroundCallCompaction(void*)+0x27d) [0x7f1be28ba18d]
 13: (rocksdb::ThreadPoolImpl::BGThread(unsigned long)+0x1a1) [0x7f1be296b8a1]
 14: (()+0x96a983) [0x7f1be296b983]
 15: (()+0x80a4) [0x7f1bdea820a4]
 16: (clone()+0x6d) [0x7f1bdd8fb04d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Also I noticed 3 more OSDs on different nodes were in the down state without any errors in their logs.
The ceph-osd processes for these OSDs were running and were consuming 100% cpu, while other
well-behaving OSDs were in up state and taking cpu of 1 to 3%.
I checked the perf stat information, and noticed all the 3 had WAL (128MB) fully consumed with 0 bytes free.
I have resurrected the setup after that, and I am currently trying to focus on getting some
performance numbers of the latest bluestore in a constrained time limit. Once I am done with that, then 
I will be help out more with gathering more detailed data, and with some debugging efforts as well.

Thanks,
Nitin


> 
>> After 5 hours or so 6+ osds were down out of 30.
>> We will be running the stress test one again with the fixed partition configuration with debug level of
>> 0, to get max performance out. And if that fails then I will switch to debug level 10 or 20, and gather
>> some detailed logs.
> 
> Any backtraces you see would be helpful.
> 
> Thanks!
> sage
> 
> 
>> 
>> Thanks,
>> Nitin
>> 
>>> 
>>>>> We had some issues in this front sometimes back which was fixed, may be a new issue (?). Need verbose log for at least bluefs (debug_bluefs = 20/20)
>>>> 
>>>> Let me fix the cluster configuration, to give better space to the DB partition. And if with that this issue comes up then I will try capturing detailed logs.
>>>> 
>>>>> BTW, what is your workload (block size, IO pattern ) ?
>>>> 
>>>> The workload is internal teradata benchmark, which simulates IO pattern of database disk access with various block sizes and IO pattern. 
>>>> 
>>>> Thanks,
>>>> Nitin
>>>> 
>>>> 
>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Kamble, Nitin A [mailto:Nitin.Kamble@Teradata.com]
>>>>> Sent: Friday, September 16, 2016 12:00 PM
>>>>> To: Somnath Roy
>>>>> Cc: Sage Weil; Ceph Development
>>>>> Subject: Re: Bluestore OSD support in ceph-disk
>>>>> 
>>>>> 
>>>>>> On Sep 16, 2016, at 11:43 AM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>>>>>> 
>>>>>> Please send the snippet (very first trace , go up in the log) where it is actually printing the assert.
>>>>>> BTW, what workload you are running ?
>>>>>> 
>>>>>> Thanks & Regards
>>>>>> Somnath
>>>>>> 
>>>>> Here it is.
>>>>> 
>>>>> 2016-09-16 08:49:30.605845 7fb5a96ba700 -1 /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_allocate(uint8_t, uint64_t, std::vecto r<bluefs_extent_t>*)' thread 7fb5a96ba700 time 2016-09-16 08:49:30.602139
>>>>> /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/BlueFS.cc: 1686: FAILED assert(0 == "allocate failed... wtf")
>>>>> 
>>>>> ceph version v11.0.0-2309-g9096ad3 (9096ad37f2c0798c26d7784fb4e7a781feb72cb8)
>>>>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7fb5bf43a11b]
>>>>> 2: (BlueFS::_allocate(unsigned char, unsigned long, std::vector<bluefs_extent_t, std::allocator<bluefs_extent_t> >*)+0x8ad) [0x7fb5bf2735dd]
>>>>> 3: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0xb4f) [0x7fb5bf27aa1f]
>>>>> 4: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x29b) [0x7fb5bf27bc9b]
>>>>> 5: (BlueRocksWritableFile::Sync()+0x4e) [0x7fb5bf29125e]
>>>>> 6: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x139) [0x7fb5bf388699]
>>>>> 7: (rocksdb::WritableFileWriter::Sync(bool)+0x88) [0x7fb5bf389238]
>>>>> 8: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool)+0x13cf) [0x7fb5bf2e0a2f]
>>>>> 9: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x27) [0x7fb5bf2e1637]
>>>>> 10: (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x5b) [0x7fb5bf21a14b]
>>>>> 11: (BlueStore::_kv_sync_thread()+0xf5a) [0x7fb5bf1e7ffa]
>>>>> 12: (BlueStore::KVSyncThread::entry()+0xd) [0x7fb5bf1f5a6d]
>>>>> 13: (()+0x80a4) [0x7fb5bb4a70a4]
>>>>> 14: (clone()+0x6d) [0x7fb5ba32004d]
>>>>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>>> 
>>>>> 
>>>>> 
>>>>> Thanks,
>>>>> Nitin
>>>>> 
>>>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>>> 
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> N?????r??y??????X??ǧv???)޺{.n?????z?]z????ay?\x1dʇڙ??j\a??f???h?????\x1e?w???\f???j:+v???w????????\a????zZ+???????j"????i


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bluestore OSD support in ceph-disk
  2016-09-18  6:41                               ` Kamble, Nitin A
@ 2016-09-18  7:06                                 ` Varada Kari
  2016-09-19  0:28                                   ` Kamble, Nitin A
  0 siblings, 1 reply; 28+ messages in thread
From: Varada Kari @ 2016-09-18  7:06 UTC (permalink / raw)
  To: Kamble, Nitin A, Somnath Roy; +Cc: Sage Weil, Ceph Development



On Sunday 18 September 2016 12:12 PM, Kamble, Nitin A wrote:
>> On Sep 16, 2016, at 4:15 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>>
>> The sizes of the partitions from ceph.conf is not mandatory. 
>> If you mention , the partition of db/wal/data , it should be using the entire size of the partition. If you mention the sizes on top , it will truncate to that size..Here is one sample config..
>>
>> [osd.0]
>>       host = emsnode10
>>       devs = /dev/sdb1
>>       bluestore_block_db_path = /dev/sdb2
>>       bluestore_block_wal_path = /dev/sdb3
>>       bluestore_block_path = /dev/sdb4
>>
>> I guess devs is not required for ceph-disk. I am using old mkcephfs stuff and that's why it is needed.
>>
>> Thanks & Regards
>> Somnath
>>
> Instead of using the config for osd definition I am creating the OSDs manually using a script in which all 
> the wal,db,block links are created. Even if one does not specify the partitions sizes, there are defaults
> predefined. Is there a way to ignore the size from the config and use the size by probing the device?
if the size is not specified in the config(ceph.conf), default the
drive/partition is probed and whole capacity is used.
varada
>
> Thanks,
> Nitin
>
>> -----Original Message-----
>> From: Kamble, Nitin A [mailto:Nitin.Kamble@Teradata.com] 
>> Sent: Friday, September 16, 2016 4:09 PM
>> To: Sage Weil
>> Cc: Somnath Roy; Ceph Development
>> Subject: Re: Bluestore OSD support in ceph-disk
>>
>>
>>> On Sep 16, 2016, at 1:54 PM, Sage Weil <sage@newdream.net> wrote:
>>>
>>> On Fri, 16 Sep 2016, Kamble, Nitin A wrote:
>>>>> On Sep 16, 2016, at 12:23 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>>>>>
>>>>> How you configured bluestore, all default ? i.e all in single partition , no separate partition for db/wal ?
>>>> It is separated partitions for data(SSD), wal(SSD), rocksdb(SSD), & block store (HDD).
>>>>
>>>>> Wondering if you are out of db space/disk space ?
>>>> I notice a misconfiguration on the cluster now. The wal & db partition use got swapped, so it is getting just 128MB db partition now. Probably this is the cause of the assert.
>>> FWIW bluefs is supposed to fall back on any allocation failure to the 
>>> next larger/slower device (wal -> db -> primary), so having a tiny wal 
>>> or tiny db shouldn't actually matter.  A bluefs log (debug bluefs = 10 
>>> or 20) leading up to any crash there would be helpful.
>>>
>>> Thanks!
>>> sage
>>>
>> Good to know this fall back mechanism.
>> In my previous run the partitions and sizes in config did not match. I see the ceph-daemon-dump showing 900MB+ used for db while db partition was 128MB. I was thinking it started overwriting on to the next partition. But instead as per this backup logic it started using HDD for db.
>>
>> One issue I see is that, ceph.conf lists the sizes of the wal,db,& block. But it is possible that actual partitions may have different sizes. From the ceph-daemon-dump output looks like it is not looking at the partition’s real size, instead the code is assuming the sizes from the config file as the partition sizes. I think probing of the size of existing devices/files will be better than taking the sizes from the config file blindly.
>>
>> After 5 hours or so 6+ osds were down out of 30.
>> We will be running the stress test one again with the fixed partition configuration with debug level of 0, to get max performance out. And if that fails then I will switch to debug level 10 or 20, and gather some detailed logs.
>>
>> Thanks,
>> Nitin
>>
>>>>> We had some issues in this front sometimes back which was fixed, may 
>>>>> be a new issue (?). Need verbose log for at least bluefs 
>>>>> (debug_bluefs = 20/20)
>>>> Let me fix the cluster configuration, to give better space to the DB partition. And if with that this issue comes up then I will try capturing detailed logs.
>>>>
>>>>> BTW, what is your workload (block size, IO pattern ) ?
>>>> The workload is internal teradata benchmark, which simulates IO pattern of database disk access with various block sizes and IO pattern. 
>>>>
>>>> Thanks,
>>>> Nitin
>>>>
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Kamble, Nitin A [mailto:Nitin.Kamble@Teradata.com]
>>>>> Sent: Friday, September 16, 2016 12:00 PM
>>>>> To: Somnath Roy
>>>>> Cc: Sage Weil; Ceph Development
>>>>> Subject: Re: Bluestore OSD support in ceph-disk
>>>>>
>>>>>
>>>>>> On Sep 16, 2016, at 11:43 AM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>>>>>>
>>>>>> Please send the snippet (very first trace , go up in the log) where it is actually printing the assert.
>>>>>> BTW, what workload you are running ?
>>>>>>
>>>>>> Thanks & Regards
>>>>>> Somnath
>>>>>>
>>>>> Here it is.
>>>>>
>>>>> 2016-09-16 08:49:30.605845 7fb5a96ba700 -1 
>>>>> /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild
>>>>> /BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/BlueFS.cc: In 
>>>>> function 'int BlueFS::_allocate(uint8_t, uint64_t, std::vecto 
>>>>> r<bluefs_extent_t>*)' thread 7fb5a96ba700 time 2016-09-16 
>>>>> 08:49:30.602139
>>>>> /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild
>>>>> /BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/BlueFS.cc: 1686: 
>>>>> FAILED assert(0 == "allocate failed... wtf")
>>>>>
>>>>> ceph version v11.0.0-2309-g9096ad3 
>>>>> (9096ad37f2c0798c26d7784fb4e7a781feb72cb8)
>>>>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
>>>>> const*)+0x8b) [0x7fb5bf43a11b]
>>>>> 2: (BlueFS::_allocate(unsigned char, unsigned long, 
>>>>> std::vector<bluefs_extent_t, std::allocator<bluefs_extent_t> 
>>>>>> *)+0x8ad) [0x7fb5bf2735dd]
>>>>> 3: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, 
>>>>> unsigned long, unsigned long)+0xb4f) [0x7fb5bf27aa1f]
>>>>> 4: (BlueFS::_fsync(BlueFS::FileWriter*, 
>>>>> std::unique_lock<std::mutex>&)+0x29b) [0x7fb5bf27bc9b]
>>>>> 5: (BlueRocksWritableFile::Sync()+0x4e) [0x7fb5bf29125e]
>>>>> 6: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x139) 
>>>>> [0x7fb5bf388699]
>>>>> 7: (rocksdb::WritableFileWriter::Sync(bool)+0x88) [0x7fb5bf389238]
>>>>> 8: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, 
>>>>> rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, 
>>>>> unsigned long, bool)+0x13cf) [0x7fb5bf2e0a2f]
>>>>> 9: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, 
>>>>> rocksdb::WriteBatch*)+0x27) [0x7fb5bf2e1637]
>>>>> 10: 
>>>>> (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::T
>>>>> ransactionImpl>)+0x5b) [0x7fb5bf21a14b]
>>>>> 11: (BlueStore::_kv_sync_thread()+0xf5a) [0x7fb5bf1e7ffa]
>>>>> 12: (BlueStore::KVSyncThread::entry()+0xd) [0x7fb5bf1f5a6d]
>>>>> 13: (()+0x80a4) [0x7fb5bb4a70a4]
>>>>> 14: (clone()+0x6d) [0x7fb5ba32004d]
>>>>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Nitin
>>>>>
>>>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>>>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>>>> info at  http://vger.kernel.org/majordomo-info.html
> N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w������j:+v���w�j�m��������zZ+�����ݢj"��!�i


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bluestore OSD support in ceph-disk
  2016-09-18  7:06                                 ` Varada Kari
@ 2016-09-19  0:28                                   ` Kamble, Nitin A
  2016-09-19  1:58                                     ` Varada Kari
  0 siblings, 1 reply; 28+ messages in thread
From: Kamble, Nitin A @ 2016-09-19  0:28 UTC (permalink / raw)
  To: Varada Kari; +Cc: Somnath Roy, Sage Weil, Ceph Development

I find the ceph-osd processes which are taking 100% cpu are all have common log last line.


It means the log rotation has triggered, and it takes forever to finish.
host5:~ # ls -lh /var/log/ceph/ceph-osd.24*
-rw-r----- 1 ceph ceph    0 Sep 18 17:00 /var/log/ceph/ceph-osd.24.log
-rw-r----- 1 ceph ceph 1.4G Sep 18 17:00 /var/log/ceph/ceph-osd.24.log-20160918

host5:~ # tail /var/log/ceph/ceph-osd.24.log-20160918
2016-09-18 11:36:18.292275 7fab858dc700 10 bluefs get_usage bdev 2 free 160031571968 (149 GB) / 160032612352 (149 GB), used 0%
2016-09-18 11:36:18.292279 7fab858dc700 10 bluefs _flush 0x7fac47a5dd00 ignoring, length 3310 < min_flush_size 65536
2016-09-18 11:36:18.292280 7fab858dc700 10 bluefs _flush 0x7fac47a5dd00 ignoring, length 3310 < min_flush_size 65536
2016-09-18 11:36:18.292281 7fab858dc700 10 bluefs _fsync 0x7fac47a5dd00 file(ino 24 size 0x3d7cdc5 mtime 2016-09-18 11:36:04.164949 bdev 0 extents [0:0xe100000+d00000,0:0xf200000+e00000,1:0x10000000+2100000,0:0x100000+200000])
2016-09-18 11:36:18.292286 7fab858dc700 10 bluefs _flush 0x7fac47a5dd00 0x1b10000~cee to file(ino 24 size 0x3d7cdc5 mtime 2016-09-18 11:36:04.164949 bdev 0 extents [0:0xe100000+d00000,0:0xf200000+e00000,1:0x10000000+2100000,0:0x100000+200000])
2016-09-18 11:36:18.292289 7fab858dc700 10 bluefs _flush_range 0x7fac47a5dd00 pos 0x1b10000 0x1b10000~cee to file(ino 24 size 0x3d7cdc5 mtime 2016-09-18 11:36:04.164949 bdev 0 extents [0:0xe100000+d00000,0:0xf200000+e00000,1:0x10000000+2100000,0:0x100000+200000])
2016-09-18 11:36:18.292292 7fab858dc700 20 bluefs _flush_range file now file(ino 24 size 0x3d7cdc5 mtime 2016-09-18 11:36:04.164949 bdev 0 extents [0:0xe100000+d00000,0:0xf200000+e00000,1:0x10000000+2100000,0:0x100000+200000])
2016-09-18 11:36:18.292296 7fab858dc700 20 bluefs _flush_range in 1:0x10000000+2100000 x_off 0x10000
2016-09-18 11:36:18.292297 7fab858dc700 20 bluefs _flush_range caching tail of 0xcee and padding block with zeros
2016-09-18 17:00:01.276990 7fab738b8700 -1 received  signal: Hangup from  PID: 89063 task name: killall -q -1 ceph-mon ceph-mds ceph-osd ceph-fuse radosgw  UID: 0

Further one of the osd process has crashed with this in the log:

2016-09-18 13:30:11.274012 7fdf399b8700 -1 /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/KernelDevice.cc: In function 'virtual void KernelDevice::aio_submit(IOContext*)' thread 7fdf399b8700 time 2016-09-18 13:30:11.270019
/build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/KernelDevice.cc: 370: FAILED assert(r == 0)

 ceph version v11.0.0-2309-g9096ad3 (9096ad37f2c0798c26d7784fb4e7a781feb72cb8)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7fdf4f73811b]
 2: (KernelDevice::aio_submit(IOContext*)+0x76d) [0x7fdf4f597dbd]
 3: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0xcbd) [0x7fdf4f575b6d]
 4: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xe9) [0x7fdf4f576c79]
 5: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x6d) [0x7fdf4f579a6d]
 6: (BlueRocksWritableFile::Sync()+0x4e) [0x7fdf4f58f25e]
 7: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x139) [0x7fdf4f686699]
 8: (rocksdb::WritableFileWriter::Sync(bool)+0x88) [0x7fdf4f687238]
 9: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool)+0x13cf) [0x7fdf4f5dea2f]
 10: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x27) [0x7fdf4f5df637]
 11: (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x5b) [0x7fdf4f51814b]
 12: (BlueStore::_kv_sync_thread()+0xf5a) [0x7fdf4f4e5ffa]
 13: (BlueStore::KVSyncThread::entry()+0xd) [0x7fdf4f4f3a6d]
 14: (()+0x80a4) [0x7fdf4b7a50a4]
 15: (clone()+0x6d) [0x7fdf4a61e04d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

This time I have captured the log with debug bluefs = 20/20

Is there a good place where I can upload the trail of the log for sharing?

Thanks,
Nitin




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bluestore OSD support in ceph-disk
  2016-09-19  0:28                                   ` Kamble, Nitin A
@ 2016-09-19  1:58                                     ` Varada Kari
  2016-09-19  3:25                                       ` Somnath Roy
  0 siblings, 1 reply; 28+ messages in thread
From: Varada Kari @ 2016-09-19  1:58 UTC (permalink / raw)
  To: Kamble, Nitin A; +Cc: Somnath Roy, Sage Weil, Ceph Development

If you are not running with latest master, could you please retry with
latest master. https://github.com/ceph/ceph/pull/11095 should solve the
problem.

if you are hitting the problem with the latest master, please post the
logs in shared location like google drive or pastebin etc...

Varada

On Monday 19 September 2016 05:58 AM, Kamble, Nitin A wrote:
> I find the ceph-osd processes which are taking 100% cpu are all have common log last line.
>
>
> It means the log rotation has triggered, and it takes forever to finish.
> host5:~ # ls -lh /var/log/ceph/ceph-osd.24*
> -rw-r----- 1 ceph ceph    0 Sep 18 17:00 /var/log/ceph/ceph-osd.24.log
> -rw-r----- 1 ceph ceph 1.4G Sep 18 17:00 /var/log/ceph/ceph-osd.24.log-20160918
>
> host5:~ # tail /var/log/ceph/ceph-osd.24.log-20160918
> 2016-09-18 11:36:18.292275 7fab858dc700 10 bluefs get_usage bdev 2 free 160031571968 (149 GB) / 160032612352 (149 GB), used 0%
> 2016-09-18 11:36:18.292279 7fab858dc700 10 bluefs _flush 0x7fac47a5dd00 ignoring, length 3310 < min_flush_size 65536
> 2016-09-18 11:36:18.292280 7fab858dc700 10 bluefs _flush 0x7fac47a5dd00 ignoring, length 3310 < min_flush_size 65536
> 2016-09-18 11:36:18.292281 7fab858dc700 10 bluefs _fsync 0x7fac47a5dd00 file(ino 24 size 0x3d7cdc5 mtime 2016-09-18 11:36:04.164949 bdev 0 extents [0:0xe100000+d00000,0:0xf200000+e00000,1:0x10000000+2100000,0:0x100000+200000])
> 2016-09-18 11:36:18.292286 7fab858dc700 10 bluefs _flush 0x7fac47a5dd00 0x1b10000~cee to file(ino 24 size 0x3d7cdc5 mtime 2016-09-18 11:36:04.164949 bdev 0 extents [0:0xe100000+d00000,0:0xf200000+e00000,1:0x10000000+2100000,0:0x100000+200000])
> 2016-09-18 11:36:18.292289 7fab858dc700 10 bluefs _flush_range 0x7fac47a5dd00 pos 0x1b10000 0x1b10000~cee to file(ino 24 size 0x3d7cdc5 mtime 2016-09-18 11:36:04.164949 bdev 0 extents [0:0xe100000+d00000,0:0xf200000+e00000,1:0x10000000+2100000,0:0x100000+200000])
> 2016-09-18 11:36:18.292292 7fab858dc700 20 bluefs _flush_range file now file(ino 24 size 0x3d7cdc5 mtime 2016-09-18 11:36:04.164949 bdev 0 extents [0:0xe100000+d00000,0:0xf200000+e00000,1:0x10000000+2100000,0:0x100000+200000])
> 2016-09-18 11:36:18.292296 7fab858dc700 20 bluefs _flush_range in 1:0x10000000+2100000 x_off 0x10000
> 2016-09-18 11:36:18.292297 7fab858dc700 20 bluefs _flush_range caching tail of 0xcee and padding block with zeros
> 2016-09-18 17:00:01.276990 7fab738b8700 -1 received  signal: Hangup from  PID: 89063 task name: killall -q -1 ceph-mon ceph-mds ceph-osd ceph-fuse radosgw  UID: 0
>
> Further one of the osd process has crashed with this in the log:
>
> 2016-09-18 13:30:11.274012 7fdf399b8700 -1 /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/KernelDevice.cc: In function 'virtual void KernelDevice::aio_submit(IOContext*)' thread 7fdf399b8700 time 2016-09-18 13:30:11.270019
> /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/KernelDevice.cc: 370: FAILED assert(r == 0)
>
>  ceph version v11.0.0-2309-g9096ad3 (9096ad37f2c0798c26d7784fb4e7a781feb72cb8)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7fdf4f73811b]
>  2: (KernelDevice::aio_submit(IOContext*)+0x76d) [0x7fdf4f597dbd]
>  3: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0xcbd) [0x7fdf4f575b6d]
>  4: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xe9) [0x7fdf4f576c79]
>  5: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x6d) [0x7fdf4f579a6d]
>  6: (BlueRocksWritableFile::Sync()+0x4e) [0x7fdf4f58f25e]
>  7: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x139) [0x7fdf4f686699]
>  8: (rocksdb::WritableFileWriter::Sync(bool)+0x88) [0x7fdf4f687238]
>  9: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool)+0x13cf) [0x7fdf4f5dea2f]
>  10: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x27) [0x7fdf4f5df637]
>  11: (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x5b) [0x7fdf4f51814b]
>  12: (BlueStore::_kv_sync_thread()+0xf5a) [0x7fdf4f4e5ffa]
>  13: (BlueStore::KVSyncThread::entry()+0xd) [0x7fdf4f4f3a6d]
>  14: (()+0x80a4) [0x7fdf4b7a50a4]
>  15: (clone()+0x6d) [0x7fdf4a61e04d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>
> This time I have captured the log with debug bluefs = 20/20
>
> Is there a good place where I can upload the trail of the log for sharing?
>
> Thanks,
> Nitin
>
>
>
>

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Bluestore OSD support in ceph-disk
  2016-09-19  1:58                                     ` Varada Kari
@ 2016-09-19  3:25                                       ` Somnath Roy
  2016-09-19  5:24                                         ` Kamble, Nitin A
  0 siblings, 1 reply; 28+ messages in thread
From: Somnath Roy @ 2016-09-19  3:25 UTC (permalink / raw)
  To: Varada Kari, Kamble, Nitin A; +Cc: Sage Weil, Ceph Development

The crash Nitin is getting is different. I think it could be related to aio limit of the Linux/disk. Check the device nr_requests and queue_depth settings. If it is related to Linux (syslog should be having that if I can recall), increase fs.aio-max-nr.
There should be an error string printed in the log before assert. Search with " aio submit got" in the ceph-osd.<num>.log.

Thanks & Regards
Somnath

-----Original Message-----
From: Varada Kari
Sent: Sunday, September 18, 2016 6:59 PM
To: Kamble, Nitin A
Cc: Somnath Roy; Sage Weil; Ceph Development
Subject: Re: Bluestore OSD support in ceph-disk

If you are not running with latest master, could you please retry with latest master. https://github.com/ceph/ceph/pull/11095 should solve the problem.

if you are hitting the problem with the latest master, please post the logs in shared location like google drive or pastebin etc...

Varada

On Monday 19 September 2016 05:58 AM, Kamble, Nitin A wrote:
> I find the ceph-osd processes which are taking 100% cpu are all have common log last line.
>
>
> It means the log rotation has triggered, and it takes forever to finish.
> host5:~ # ls -lh /var/log/ceph/ceph-osd.24*
> -rw-r----- 1 ceph ceph    0 Sep 18 17:00 /var/log/ceph/ceph-osd.24.log
> -rw-r----- 1 ceph ceph 1.4G Sep 18 17:00
> /var/log/ceph/ceph-osd.24.log-20160918
>
> host5:~ # tail /var/log/ceph/ceph-osd.24.log-20160918
> 2016-09-18 11:36:18.292275 7fab858dc700 10 bluefs get_usage bdev 2
> free 160031571968 (149 GB) / 160032612352 (149 GB), used 0%
> 2016-09-18 11:36:18.292279 7fab858dc700 10 bluefs _flush
> 0x7fac47a5dd00 ignoring, length 3310 < min_flush_size 65536
> 2016-09-18 11:36:18.292280 7fab858dc700 10 bluefs _flush
> 0x7fac47a5dd00 ignoring, length 3310 < min_flush_size 65536
> 2016-09-18 11:36:18.292281 7fab858dc700 10 bluefs _fsync
> 0x7fac47a5dd00 file(ino 24 size 0x3d7cdc5 mtime 2016-09-18
> 11:36:04.164949 bdev 0 extents
> [0:0xe100000+d00000,0:0xf200000+e00000,1:0x10000000+2100000,0:0x100000
> +200000])
> 2016-09-18 11:36:18.292286 7fab858dc700 10 bluefs _flush
> 0x7fac47a5dd00 0x1b10000~cee to file(ino 24 size 0x3d7cdc5 mtime
> 2016-09-18 11:36:04.164949 bdev 0 extents
> [0:0xe100000+d00000,0:0xf200000+e00000,1:0x10000000+2100000,0:0x100000
> +200000])
> 2016-09-18 11:36:18.292289 7fab858dc700 10 bluefs _flush_range
> 0x7fac47a5dd00 pos 0x1b10000 0x1b10000~cee to file(ino 24 size
> 0x3d7cdc5 mtime 2016-09-18 11:36:04.164949 bdev 0 extents
> [0:0xe100000+d00000,0:0xf200000+e00000,1:0x10000000+2100000,0:0x100000
> +200000])
> 2016-09-18 11:36:18.292292 7fab858dc700 20 bluefs _flush_range file
> now file(ino 24 size 0x3d7cdc5 mtime 2016-09-18 11:36:04.164949 bdev 0
> extents
> [0:0xe100000+d00000,0:0xf200000+e00000,1:0x10000000+2100000,0:0x100000
> +200000])
> 2016-09-18 11:36:18.292296 7fab858dc700 20 bluefs _flush_range in
> 1:0x10000000+2100000 x_off 0x10000
> 2016-09-18 11:36:18.292297 7fab858dc700 20 bluefs _flush_range caching
> tail of 0xcee and padding block with zeros
> 2016-09-18 17:00:01.276990 7fab738b8700 -1 received  signal: Hangup
> from  PID: 89063 task name: killall -q -1 ceph-mon ceph-mds ceph-osd
> ceph-fuse radosgw  UID: 0
>
> Further one of the osd process has crashed with this in the log:
>
> 2016-09-18 13:30:11.274012 7fdf399b8700 -1
> /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/B
> UILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/KernelDevice.cc: In
> function 'virtual void KernelDevice::aio_submit(IOContext*)' thread
> 7fdf399b8700 time 2016-09-18 13:30:11.270019
> /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/B
> UILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/KernelDevice.cc: 370:
> FAILED assert(r == 0)
>
>  ceph version v11.0.0-2309-g9096ad3
> (9096ad37f2c0798c26d7784fb4e7a781feb72cb8)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x8b) [0x7fdf4f73811b]
>  2: (KernelDevice::aio_submit(IOContext*)+0x76d) [0x7fdf4f597dbd]
>  3: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned
> long)+0xcbd) [0x7fdf4f575b6d]
>  4: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xe9) [0x7fdf4f576c79]
>  5: (BlueFS::_fsync(BlueFS::FileWriter*,
> std::unique_lock<std::mutex>&)+0x6d) [0x7fdf4f579a6d]
>  6: (BlueRocksWritableFile::Sync()+0x4e) [0x7fdf4f58f25e]
>  7: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x139)
> [0x7fdf4f686699]
>  8: (rocksdb::WritableFileWriter::Sync(bool)+0x88) [0x7fdf4f687238]
>  9: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&,
> rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*,
> unsigned long, bool)+0x13cf) [0x7fdf4f5dea2f]
>  10: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&,
> rocksdb::WriteBatch*)+0x27) [0x7fdf4f5df637]
>  11:
> (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::Tra
> nsactionImpl>)+0x5b) [0x7fdf4f51814b]
>  12: (BlueStore::_kv_sync_thread()+0xf5a) [0x7fdf4f4e5ffa]
>  13: (BlueStore::KVSyncThread::entry()+0xd) [0x7fdf4f4f3a6d]
>  14: (()+0x80a4) [0x7fdf4b7a50a4]
>  15: (clone()+0x6d) [0x7fdf4a61e04d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>
> This time I have captured the log with debug bluefs = 20/20
>
> Is there a good place where I can upload the trail of the log for sharing?
>
> Thanks,
> Nitin
>
>
>
>

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bluestore OSD support in ceph-disk
  2016-09-19  3:25                                       ` Somnath Roy
@ 2016-09-19  5:24                                         ` Kamble, Nitin A
  2016-09-19  5:32                                           ` Somnath Roy
  0 siblings, 1 reply; 28+ messages in thread
From: Kamble, Nitin A @ 2016-09-19  5:24 UTC (permalink / raw)
  To: Somnath Roy; +Cc: Varada Kari, Sage Weil, Ceph Development


> On Sep 18, 2016, at 8:25 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> 
> The crash Nitin is getting is different. I think it could be related to aio limit of the Linux/disk. Check the device nr_requests and queue_depth settings. If it is related to Linux (syslog should be having that if I can recall), increase fs.aio-max-nr.
> There should be an error string printed in the log before assert. Search with " aio submit got" in the ceph-osd.<num>.log.
> 
> Thanks & Regards
> Somnath

Here is further information if it helps in understanding the assert.

host4:/ # cat /proc/sys/fs/aio-max-nr 
65536
host4:/ # cat /sys/block/sdk/device/queue_depth 
256
host4:/ # cat /sys/block/sdk/queue/nr_requests 
128

As seen below, there is an "aio submit got" error just before the assert.
I did not find anything related in dmesg

2016-09-18 13:30:11.011673 7fdf399b8700 20 bluefs _flush_range in 0:0xdd00000+400000 x_off 0x3ec257
2016-09-18 13:30:11.011674 7fdf399b8700 20 bluefs _flush_range using partial tail 0x257
2016-09-18 13:30:11.011676 7fdf399b8700 20 bluefs _flush_range waiting for previous aio to complete
2016-09-18 13:30:11.011711 7fdf399b8700 20 bluefs _flush_range h 0x7fe026503400 pos now 0x3c00000
2016-09-18 13:30:11.011732 7fdf399b8700 10 bluefs _flush 0x7fe026503400 ignoring, length 15358 < min_flush_size 65536
2016-09-18 13:30:11.011932 7fdf399b8700 10 bluefs get_usage bdev 0 free 41943040 (40960 kB) / 268431360 (255 MB), used 84%
2016-09-18 13:30:11.011935 7fdf399b8700 10 bluefs get_usage bdev 1 free 77204553728 (73628 MB) / 79456886784 (75775 MB), used 2%
2016-09-18 13:30:11.011938 7fdf399b8700 10 bluefs get_usage bdev 2 free 160031571968 (149 GB) / 160032612352 (149 GB), used 0%
2016-09-18 13:30:11.011940 7fdf399b8700 10 bluefs _flush 0x7fe026503400 ignoring, length 15381 < min_flush_size 65536
2016-09-18 13:30:11.011941 7fdf399b8700 10 bluefs _flush 0x7fe026503400 ignoring, length 15381 < min_flush_size 65536
2016-09-18 13:30:11.011942 7fdf399b8700 10 bluefs _fsync 0x7fe026503400 file(ino 23 size 0x3d2a7de mtime 2016-09-18 13:28:24.844259 bdev 0 extents [0:0x9500000+100000,0:0x9a00000+1200000,0:0xb000000+1300000,0:0xc700000+1200000,0:0xdd00000+400000,1:0xe100000+200000])
2016-09-18 13:30:11.011946 7fdf399b8700 10 bluefs _flush 0x7fe026503400 0x3c00000~3c15 to file(ino 23 size 0x3d2a7de mtime 2016-09-18 13:28:24.844259 bdev 0 extents [0:0x9500000+100000,0:0x9a00000+1200000,0:0xb000000+1300000,0:0xc700000+1200000,0:0xdd00000+400000,1:0xe100000+200000])
2016-09-18 13:30:11.011949 7fdf399b8700 10 bluefs _flush_range 0x7fe026503400 pos 0x3c00000 0x3c00000~3c15 to file(ino 23 size 0x3d2a7de mtime 2016-09-18 13:28:24.844259 bdev 0 extents [0:0x9500000+100000,0:0x9a00000+1200000,0:0xb000000+1300000,0:0xc700000+1200000,0:0xdd00000+400000,1:0xe100000+200000])
2016-09-18 13:30:11.011953 7fdf399b8700 20 bluefs _flush_range file now file(ino 23 size 0x3d2a7de mtime 2016-09-18 13:28:24.844259 bdev 0 extents [0:0x9500000+100000,0:0x9a00000+1200000,0:0xb000000+1300000,0:0xc700000+1200000,0:0xdd00000+400000,1:0xe100000+200000])
2016-09-18 13:30:11.011956 7fdf399b8700 20 bluefs _flush_range in 1:0xe100000+200000 x_off 0x0
2016-09-18 13:30:11.011958 7fdf399b8700 20 bluefs _flush_range caching tail of 0xc15 and padding block with zeros
2016-09-18 13:30:11.270003 7fdf399b8700 -1 bdev(/var/lib/ceph/osd/ceph-18/block.wal)  aio submit got (9) Bad file descriptor
2016-09-18 13:30:11.274012 7fdf399b8700 -1 /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/KernelDevice.cc: In function 'virtual void KernelDevice::aio_submit(IOContext*)' thread 7fdf399b8700 time 2016-09-18 13:30:11.270019
/build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/KernelDevice.cc: 370: FAILED assert(r == 0)

 ceph version v11.0.0-2309-g9096ad3 (9096ad37f2c0798c26d7784fb4e7a781feb72cb8)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7fdf4f73811b]
 2: (KernelDevice::aio_submit(IOContext*)+0x76d) [0x7fdf4f597dbd]
 3: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0xcbd) [0x7fdf4f575b6d]
 4: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xe9) [0x7fdf4f576c79]
 5: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x6d) [0x7fdf4f579a6d]
 6: (BlueRocksWritableFile::Sync()+0x4e) [0x7fdf4f58f25e]
 7: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x139) [0x7fdf4f686699]
 8: (rocksdb::WritableFileWriter::Sync(bool)+0x88) [0x7fdf4f687238]
 9: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool)+0x13cf) [0x7fdf4f5dea2f]
 10: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x27) [0x7fdf4f5df637]
 11: (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x5b) [0x7fdf4f51814b]
 12: (BlueStore::_kv_sync_thread()+0xf5a) [0x7fdf4f4e5ffa]
 13: (BlueStore::KVSyncThread::entry()+0xd) [0x7fdf4f4f3a6d]
 14: (()+0x80a4) [0x7fdf4b7a50a4]
 15: (clone()+0x6d) [0x7fdf4a61e04d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Thanks,
Nitin


> 
> -----Original Message-----
> From: Varada Kari
> Sent: Sunday, September 18, 2016 6:59 PM
> To: Kamble, Nitin A
> Cc: Somnath Roy; Sage Weil; Ceph Development
> Subject: Re: Bluestore OSD support in ceph-disk
> 
> If you are not running with latest master, could you please retry with latest master. https://github.com/ceph/ceph/pull/11095 should solve the problem.
> 
> if you are hitting the problem with the latest master, please post the logs in shared location like google drive or pastebin etc...
> 
> Varada
> 
> On Monday 19 September 2016 05:58 AM, Kamble, Nitin A wrote:
>> I find the ceph-osd processes which are taking 100% cpu are all have common log last line.
>> 
>> 
>> It means the log rotation has triggered, and it takes forever to finish.
>> host5:~ # ls -lh /var/log/ceph/ceph-osd.24*
>> -rw-r----- 1 ceph ceph    0 Sep 18 17:00 /var/log/ceph/ceph-osd.24.log
>> -rw-r----- 1 ceph ceph 1.4G Sep 18 17:00
>> /var/log/ceph/ceph-osd.24.log-20160918
>> 
>> host5:~ # tail /var/log/ceph/ceph-osd.24.log-20160918
>> 2016-09-18 11:36:18.292275 7fab858dc700 10 bluefs get_usage bdev 2
>> free 160031571968 (149 GB) / 160032612352 (149 GB), used 0%
>> 2016-09-18 11:36:18.292279 7fab858dc700 10 bluefs _flush
>> 0x7fac47a5dd00 ignoring, length 3310 < min_flush_size 65536
>> 2016-09-18 11:36:18.292280 7fab858dc700 10 bluefs _flush
>> 0x7fac47a5dd00 ignoring, length 3310 < min_flush_size 65536
>> 2016-09-18 11:36:18.292281 7fab858dc700 10 bluefs _fsync
>> 0x7fac47a5dd00 file(ino 24 size 0x3d7cdc5 mtime 2016-09-18
>> 11:36:04.164949 bdev 0 extents
>> [0:0xe100000+d00000,0:0xf200000+e00000,1:0x10000000+2100000,0:0x100000
>> +200000])
>> 2016-09-18 11:36:18.292286 7fab858dc700 10 bluefs _flush
>> 0x7fac47a5dd00 0x1b10000~cee to file(ino 24 size 0x3d7cdc5 mtime
>> 2016-09-18 11:36:04.164949 bdev 0 extents
>> [0:0xe100000+d00000,0:0xf200000+e00000,1:0x10000000+2100000,0:0x100000
>> +200000])
>> 2016-09-18 11:36:18.292289 7fab858dc700 10 bluefs _flush_range
>> 0x7fac47a5dd00 pos 0x1b10000 0x1b10000~cee to file(ino 24 size
>> 0x3d7cdc5 mtime 2016-09-18 11:36:04.164949 bdev 0 extents
>> [0:0xe100000+d00000,0:0xf200000+e00000,1:0x10000000+2100000,0:0x100000
>> +200000])
>> 2016-09-18 11:36:18.292292 7fab858dc700 20 bluefs _flush_range file
>> now file(ino 24 size 0x3d7cdc5 mtime 2016-09-18 11:36:04.164949 bdev 0
>> extents
>> [0:0xe100000+d00000,0:0xf200000+e00000,1:0x10000000+2100000,0:0x100000
>> +200000])
>> 2016-09-18 11:36:18.292296 7fab858dc700 20 bluefs _flush_range in
>> 1:0x10000000+2100000 x_off 0x10000
>> 2016-09-18 11:36:18.292297 7fab858dc700 20 bluefs _flush_range caching
>> tail of 0xcee and padding block with zeros
>> 2016-09-18 17:00:01.276990 7fab738b8700 -1 received  signal: Hangup
>> from  PID: 89063 task name: killall -q -1 ceph-mon ceph-mds ceph-osd
>> ceph-fuse radosgw  UID: 0
>> 
>> Further one of the osd process has crashed with this in the log:
>> 
>> 2016-09-18 13:30:11.274012 7fdf399b8700 -1
>> /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/B
>> UILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/KernelDevice.cc: In
>> function 'virtual void KernelDevice::aio_submit(IOContext*)' thread
>> 7fdf399b8700 time 2016-09-18 13:30:11.270019
>> /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/B
>> UILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/KernelDevice.cc: 370:
>> FAILED assert(r == 0)
>> 
>> ceph version v11.0.0-2309-g9096ad3
>> (9096ad37f2c0798c26d7784fb4e7a781feb72cb8)
>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x8b) [0x7fdf4f73811b]
>> 2: (KernelDevice::aio_submit(IOContext*)+0x76d) [0x7fdf4f597dbd]
>> 3: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned
>> long)+0xcbd) [0x7fdf4f575b6d]
>> 4: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xe9) [0x7fdf4f576c79]
>> 5: (BlueFS::_fsync(BlueFS::FileWriter*,
>> std::unique_lock<std::mutex>&)+0x6d) [0x7fdf4f579a6d]
>> 6: (BlueRocksWritableFile::Sync()+0x4e) [0x7fdf4f58f25e]
>> 7: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x139)
>> [0x7fdf4f686699]
>> 8: (rocksdb::WritableFileWriter::Sync(bool)+0x88) [0x7fdf4f687238]
>> 9: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&,
>> rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*,
>> unsigned long, bool)+0x13cf) [0x7fdf4f5dea2f]
>> 10: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&,
>> rocksdb::WriteBatch*)+0x27) [0x7fdf4f5df637]
>> 11:
>> (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::Tra
>> nsactionImpl>)+0x5b) [0x7fdf4f51814b]
>> 12: (BlueStore::_kv_sync_thread()+0xf5a) [0x7fdf4f4e5ffa]
>> 13: (BlueStore::KVSyncThread::entry()+0xd) [0x7fdf4f4f3a6d]
>> 14: (()+0x80a4) [0x7fdf4b7a50a4]
>> 15: (clone()+0x6d) [0x7fdf4a61e04d]
>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>> 
>> This time I have captured the log with debug bluefs = 20/20
>> 
>> Is there a good place where I can upload the trail of the log for sharing?
>> 
>> Thanks,
>> Nitin
>> 
>> 
>> 
>> 
> 
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Bluestore OSD support in ceph-disk
  2016-09-19  5:24                                         ` Kamble, Nitin A
@ 2016-09-19  5:32                                           ` Somnath Roy
  2016-09-20  5:47                                             ` Kamble, Nitin A
  0 siblings, 1 reply; 28+ messages in thread
From: Somnath Roy @ 2016-09-19  5:32 UTC (permalink / raw)
  To: Kamble, Nitin A; +Cc: Varada Kari, Sage Weil, Ceph Development

Not what I thought, some corruption it seems..Please post a verbose log if possible..

-----Original Message-----
From: Kamble, Nitin A [mailto:Nitin.Kamble@Teradata.com] 
Sent: Sunday, September 18, 2016 10:24 PM
To: Somnath Roy
Cc: Varada Kari; Sage Weil; Ceph Development
Subject: Re: Bluestore OSD support in ceph-disk


> On Sep 18, 2016, at 8:25 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> 
> The crash Nitin is getting is different. I think it could be related to aio limit of the Linux/disk. Check the device nr_requests and queue_depth settings. If it is related to Linux (syslog should be having that if I can recall), increase fs.aio-max-nr.
> There should be an error string printed in the log before assert. Search with " aio submit got" in the ceph-osd.<num>.log.
> 
> Thanks & Regards
> Somnath

Here is further information if it helps in understanding the assert.

host4:/ # cat /proc/sys/fs/aio-max-nr
65536
host4:/ # cat /sys/block/sdk/device/queue_depth
256
host4:/ # cat /sys/block/sdk/queue/nr_requests
128

As seen below, there is an "aio submit got" error just before the assert.
I did not find anything related in dmesg

2016-09-18 13:30:11.011673 7fdf399b8700 20 bluefs _flush_range in 0:0xdd00000+400000 x_off 0x3ec257
2016-09-18 13:30:11.011674 7fdf399b8700 20 bluefs _flush_range using partial tail 0x257
2016-09-18 13:30:11.011676 7fdf399b8700 20 bluefs _flush_range waiting for previous aio to complete
2016-09-18 13:30:11.011711 7fdf399b8700 20 bluefs _flush_range h 0x7fe026503400 pos now 0x3c00000
2016-09-18 13:30:11.011732 7fdf399b8700 10 bluefs _flush 0x7fe026503400 ignoring, length 15358 < min_flush_size 65536
2016-09-18 13:30:11.011932 7fdf399b8700 10 bluefs get_usage bdev 0 free 41943040 (40960 kB) / 268431360 (255 MB), used 84%
2016-09-18 13:30:11.011935 7fdf399b8700 10 bluefs get_usage bdev 1 free 77204553728 (73628 MB) / 79456886784 (75775 MB), used 2%
2016-09-18 13:30:11.011938 7fdf399b8700 10 bluefs get_usage bdev 2 free 160031571968 (149 GB) / 160032612352 (149 GB), used 0%
2016-09-18 13:30:11.011940 7fdf399b8700 10 bluefs _flush 0x7fe026503400 ignoring, length 15381 < min_flush_size 65536
2016-09-18 13:30:11.011941 7fdf399b8700 10 bluefs _flush 0x7fe026503400 ignoring, length 15381 < min_flush_size 65536
2016-09-18 13:30:11.011942 7fdf399b8700 10 bluefs _fsync 0x7fe026503400 file(ino 23 size 0x3d2a7de mtime 2016-09-18 13:28:24.844259 bdev 0 extents [0:0x9500000+100000,0:0x9a00000+1200000,0:0xb000000+1300000,0:0xc700000+1200000,0:0xdd00000+400000,1:0xe100000+200000])
2016-09-18 13:30:11.011946 7fdf399b8700 10 bluefs _flush 0x7fe026503400 0x3c00000~3c15 to file(ino 23 size 0x3d2a7de mtime 2016-09-18 13:28:24.844259 bdev 0 extents [0:0x9500000+100000,0:0x9a00000+1200000,0:0xb000000+1300000,0:0xc700000+1200000,0:0xdd00000+400000,1:0xe100000+200000])
2016-09-18 13:30:11.011949 7fdf399b8700 10 bluefs _flush_range 0x7fe026503400 pos 0x3c00000 0x3c00000~3c15 to file(ino 23 size 0x3d2a7de mtime 2016-09-18 13:28:24.844259 bdev 0 extents [0:0x9500000+100000,0:0x9a00000+1200000,0:0xb000000+1300000,0:0xc700000+1200000,0:0xdd00000+400000,1:0xe100000+200000])
2016-09-18 13:30:11.011953 7fdf399b8700 20 bluefs _flush_range file now file(ino 23 size 0x3d2a7de mtime 2016-09-18 13:28:24.844259 bdev 0 extents [0:0x9500000+100000,0:0x9a00000+1200000,0:0xb000000+1300000,0:0xc700000+1200000,0:0xdd00000+400000,1:0xe100000+200000])
2016-09-18 13:30:11.011956 7fdf399b8700 20 bluefs _flush_range in 1:0xe100000+200000 x_off 0x0
2016-09-18 13:30:11.011958 7fdf399b8700 20 bluefs _flush_range caching tail of 0xc15 and padding block with zeros
2016-09-18 13:30:11.270003 7fdf399b8700 -1 bdev(/var/lib/ceph/osd/ceph-18/block.wal)  aio submit got (9) Bad file descriptor
2016-09-18 13:30:11.274012 7fdf399b8700 -1 /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/KernelDevice.cc: In function 'virtual void KernelDevice::aio_submit(IOContext*)' thread 7fdf399b8700 time 2016-09-18 13:30:11.270019
/build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/KernelDevice.cc: 370: FAILED assert(r == 0)

 ceph version v11.0.0-2309-g9096ad3 (9096ad37f2c0798c26d7784fb4e7a781feb72cb8)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7fdf4f73811b]
 2: (KernelDevice::aio_submit(IOContext*)+0x76d) [0x7fdf4f597dbd]
 3: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0xcbd) [0x7fdf4f575b6d]
 4: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xe9) [0x7fdf4f576c79]
 5: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x6d) [0x7fdf4f579a6d]
 6: (BlueRocksWritableFile::Sync()+0x4e) [0x7fdf4f58f25e]
 7: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x139) [0x7fdf4f686699]
 8: (rocksdb::WritableFileWriter::Sync(bool)+0x88) [0x7fdf4f687238]
 9: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool)+0x13cf) [0x7fdf4f5dea2f]
 10: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x27) [0x7fdf4f5df637]
 11: (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x5b) [0x7fdf4f51814b]
 12: (BlueStore::_kv_sync_thread()+0xf5a) [0x7fdf4f4e5ffa]
 13: (BlueStore::KVSyncThread::entry()+0xd) [0x7fdf4f4f3a6d]
 14: (()+0x80a4) [0x7fdf4b7a50a4]
 15: (clone()+0x6d) [0x7fdf4a61e04d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Thanks,
Nitin


> 
> -----Original Message-----
> From: Varada Kari
> Sent: Sunday, September 18, 2016 6:59 PM
> To: Kamble, Nitin A
> Cc: Somnath Roy; Sage Weil; Ceph Development
> Subject: Re: Bluestore OSD support in ceph-disk
> 
> If you are not running with latest master, could you please retry with latest master. https://github.com/ceph/ceph/pull/11095 should solve the problem.
> 
> if you are hitting the problem with the latest master, please post the logs in shared location like google drive or pastebin etc...
> 
> Varada
> 
> On Monday 19 September 2016 05:58 AM, Kamble, Nitin A wrote:
>> I find the ceph-osd processes which are taking 100% cpu are all have common log last line.
>> 
>> 
>> It means the log rotation has triggered, and it takes forever to finish.
>> host5:~ # ls -lh /var/log/ceph/ceph-osd.24*
>> -rw-r----- 1 ceph ceph    0 Sep 18 17:00 /var/log/ceph/ceph-osd.24.log
>> -rw-r----- 1 ceph ceph 1.4G Sep 18 17:00
>> /var/log/ceph/ceph-osd.24.log-20160918
>> 
>> host5:~ # tail /var/log/ceph/ceph-osd.24.log-20160918
>> 2016-09-18 11:36:18.292275 7fab858dc700 10 bluefs get_usage bdev 2 
>> free 160031571968 (149 GB) / 160032612352 (149 GB), used 0%
>> 2016-09-18 11:36:18.292279 7fab858dc700 10 bluefs _flush
>> 0x7fac47a5dd00 ignoring, length 3310 < min_flush_size 65536
>> 2016-09-18 11:36:18.292280 7fab858dc700 10 bluefs _flush
>> 0x7fac47a5dd00 ignoring, length 3310 < min_flush_size 65536
>> 2016-09-18 11:36:18.292281 7fab858dc700 10 bluefs _fsync
>> 0x7fac47a5dd00 file(ino 24 size 0x3d7cdc5 mtime 2016-09-18
>> 11:36:04.164949 bdev 0 extents
>> [0:0xe100000+d00000,0:0xf200000+e00000,1:0x10000000+2100000,0:0x10000
>> 0
>> +200000])
>> 2016-09-18 11:36:18.292286 7fab858dc700 10 bluefs _flush
>> 0x7fac47a5dd00 0x1b10000~cee to file(ino 24 size 0x3d7cdc5 mtime
>> 2016-09-18 11:36:04.164949 bdev 0 extents
>> [0:0xe100000+d00000,0:0xf200000+e00000,1:0x10000000+2100000,0:0x10000
>> 0
>> +200000])
>> 2016-09-18 11:36:18.292289 7fab858dc700 10 bluefs _flush_range
>> 0x7fac47a5dd00 pos 0x1b10000 0x1b10000~cee to file(ino 24 size
>> 0x3d7cdc5 mtime 2016-09-18 11:36:04.164949 bdev 0 extents
>> [0:0xe100000+d00000,0:0xf200000+e00000,1:0x10000000+2100000,0:0x10000
>> 0
>> +200000])
>> 2016-09-18 11:36:18.292292 7fab858dc700 20 bluefs _flush_range file 
>> now file(ino 24 size 0x3d7cdc5 mtime 2016-09-18 11:36:04.164949 bdev 
>> 0 extents
>> [0:0xe100000+d00000,0:0xf200000+e00000,1:0x10000000+2100000,0:0x10000
>> 0
>> +200000])
>> 2016-09-18 11:36:18.292296 7fab858dc700 20 bluefs _flush_range in
>> 1:0x10000000+2100000 x_off 0x10000
>> 2016-09-18 11:36:18.292297 7fab858dc700 20 bluefs _flush_range 
>> caching tail of 0xcee and padding block with zeros
>> 2016-09-18 17:00:01.276990 7fab738b8700 -1 received  signal: Hangup 
>> from  PID: 89063 task name: killall -q -1 ceph-mon ceph-mds ceph-osd 
>> ceph-fuse radosgw  UID: 0
>> 
>> Further one of the osd process has crashed with this in the log:
>> 
>> 2016-09-18 13:30:11.274012 7fdf399b8700 -1 
>> /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/
>> B
>> UILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/KernelDevice.cc: In 
>> function 'virtual void KernelDevice::aio_submit(IOContext*)' thread
>> 7fdf399b8700 time 2016-09-18 13:30:11.270019 
>> /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/
>> B
>> UILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/KernelDevice.cc: 370:
>> FAILED assert(r == 0)
>> 
>> ceph version v11.0.0-2309-g9096ad3
>> (9096ad37f2c0798c26d7784fb4e7a781feb72cb8)
>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x8b) [0x7fdf4f73811b]
>> 2: (KernelDevice::aio_submit(IOContext*)+0x76d) [0x7fdf4f597dbd]
>> 3: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned
>> long)+0xcbd) [0x7fdf4f575b6d]
>> 4: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xe9) [0x7fdf4f576c79]
>> 5: (BlueFS::_fsync(BlueFS::FileWriter*,
>> std::unique_lock<std::mutex>&)+0x6d) [0x7fdf4f579a6d]
>> 6: (BlueRocksWritableFile::Sync()+0x4e) [0x7fdf4f58f25e]
>> 7: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x139)
>> [0x7fdf4f686699]
>> 8: (rocksdb::WritableFileWriter::Sync(bool)+0x88) [0x7fdf4f687238]
>> 9: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, 
>> rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, 
>> unsigned long, bool)+0x13cf) [0x7fdf4f5dea2f]
>> 10: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&,
>> rocksdb::WriteBatch*)+0x27) [0x7fdf4f5df637]
>> 11:
>> (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::Tr
>> a
>> nsactionImpl>)+0x5b) [0x7fdf4f51814b]
>> 12: (BlueStore::_kv_sync_thread()+0xf5a) [0x7fdf4f4e5ffa]
>> 13: (BlueStore::KVSyncThread::entry()+0xd) [0x7fdf4f4f3a6d]
>> 14: (()+0x80a4) [0x7fdf4b7a50a4]
>> 15: (clone()+0x6d) [0x7fdf4a61e04d]
>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>> 
>> This time I have captured the log with debug bluefs = 20/20
>> 
>> Is there a good place where I can upload the trail of the log for sharing?
>> 
>> Thanks,
>> Nitin
>> 
>> 
>> 
>> 
> 
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bluestore OSD support in ceph-disk
  2016-09-19  5:32                                           ` Somnath Roy
@ 2016-09-20  5:47                                             ` Kamble, Nitin A
  2016-09-20 11:53                                               ` Sage Weil
  0 siblings, 1 reply; 28+ messages in thread
From: Kamble, Nitin A @ 2016-09-20  5:47 UTC (permalink / raw)
  To: Somnath Roy; +Cc: Varada Kari, Sage Weil, Ceph Development

I could not respond earlier as I am in the SNIA storage developer conference in
Santa Clara this week. 
BTW if any of you are in this conference feel free to pull me aside.


Here is the trail of the log from the point of assert. 

https://app.box.com/s/lei6g8ddp2j1y29p6gvpjjut9okzzkpd

Thanks,
Nitin

> On Sep 18, 2016, at 10:32 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> 
> Not what I thought, some corruption it seems..Please post a verbose log if possible..
> 
> -----Original Message-----
> From: Kamble, Nitin A [mailto:Nitin.Kamble@Teradata.com] 
> Sent: Sunday, September 18, 2016 10:24 PM
> To: Somnath Roy
> Cc: Varada Kari; Sage Weil; Ceph Development
> Subject: Re: Bluestore OSD support in ceph-disk
> 
> 
>> On Sep 18, 2016, at 8:25 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>> 
>> The crash Nitin is getting is different. I think it could be related to aio limit of the Linux/disk. Check the device nr_requests and queue_depth settings. If it is related to Linux (syslog should be having that if I can recall), increase fs.aio-max-nr.
>> There should be an error string printed in the log before assert. Search with " aio submit got" in the ceph-osd.<num>.log.
>> 
>> Thanks & Regards
>> Somnath
> 
> Here is further information if it helps in understanding the assert.
> 
> host4:/ # cat /proc/sys/fs/aio-max-nr
> 65536
> host4:/ # cat /sys/block/sdk/device/queue_depth
> 256
> host4:/ # cat /sys/block/sdk/queue/nr_requests
> 128
> 
> As seen below, there is an "aio submit got" error just before the assert.
> I did not find anything related in dmesg
> 
> 2016-09-18 13:30:11.011673 7fdf399b8700 20 bluefs _flush_range in 0:0xdd00000+400000 x_off 0x3ec257
> 2016-09-18 13:30:11.011674 7fdf399b8700 20 bluefs _flush_range using partial tail 0x257
> 2016-09-18 13:30:11.011676 7fdf399b8700 20 bluefs _flush_range waiting for previous aio to complete
> 2016-09-18 13:30:11.011711 7fdf399b8700 20 bluefs _flush_range h 0x7fe026503400 pos now 0x3c00000
> 2016-09-18 13:30:11.011732 7fdf399b8700 10 bluefs _flush 0x7fe026503400 ignoring, length 15358 < min_flush_size 65536
> 2016-09-18 13:30:11.011932 7fdf399b8700 10 bluefs get_usage bdev 0 free 41943040 (40960 kB) / 268431360 (255 MB), used 84%
> 2016-09-18 13:30:11.011935 7fdf399b8700 10 bluefs get_usage bdev 1 free 77204553728 (73628 MB) / 79456886784 (75775 MB), used 2%
> 2016-09-18 13:30:11.011938 7fdf399b8700 10 bluefs get_usage bdev 2 free 160031571968 (149 GB) / 160032612352 (149 GB), used 0%
> 2016-09-18 13:30:11.011940 7fdf399b8700 10 bluefs _flush 0x7fe026503400 ignoring, length 15381 < min_flush_size 65536
> 2016-09-18 13:30:11.011941 7fdf399b8700 10 bluefs _flush 0x7fe026503400 ignoring, length 15381 < min_flush_size 65536
> 2016-09-18 13:30:11.011942 7fdf399b8700 10 bluefs _fsync 0x7fe026503400 file(ino 23 size 0x3d2a7de mtime 2016-09-18 13:28:24.844259 bdev 0 extents [0:0x9500000+100000,0:0x9a00000+1200000,0:0xb000000+1300000,0:0xc700000+1200000,0:0xdd00000+400000,1:0xe100000+200000])
> 2016-09-18 13:30:11.011946 7fdf399b8700 10 bluefs _flush 0x7fe026503400 0x3c00000~3c15 to file(ino 23 size 0x3d2a7de mtime 2016-09-18 13:28:24.844259 bdev 0 extents [0:0x9500000+100000,0:0x9a00000+1200000,0:0xb000000+1300000,0:0xc700000+1200000,0:0xdd00000+400000,1:0xe100000+200000])
> 2016-09-18 13:30:11.011949 7fdf399b8700 10 bluefs _flush_range 0x7fe026503400 pos 0x3c00000 0x3c00000~3c15 to file(ino 23 size 0x3d2a7de mtime 2016-09-18 13:28:24.844259 bdev 0 extents [0:0x9500000+100000,0:0x9a00000+1200000,0:0xb000000+1300000,0:0xc700000+1200000,0:0xdd00000+400000,1:0xe100000+200000])
> 2016-09-18 13:30:11.011953 7fdf399b8700 20 bluefs _flush_range file now file(ino 23 size 0x3d2a7de mtime 2016-09-18 13:28:24.844259 bdev 0 extents [0:0x9500000+100000,0:0x9a00000+1200000,0:0xb000000+1300000,0:0xc700000+1200000,0:0xdd00000+400000,1:0xe100000+200000])
> 2016-09-18 13:30:11.011956 7fdf399b8700 20 bluefs _flush_range in 1:0xe100000+200000 x_off 0x0
> 2016-09-18 13:30:11.011958 7fdf399b8700 20 bluefs _flush_range caching tail of 0xc15 and padding block with zeros
> 2016-09-18 13:30:11.270003 7fdf399b8700 -1 bdev(/var/lib/ceph/osd/ceph-18/block.wal)  aio submit got (9) Bad file descriptor
> 2016-09-18 13:30:11.274012 7fdf399b8700 -1 /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/KernelDevice.cc: In function 'virtual void KernelDevice::aio_submit(IOContext*)' thread 7fdf399b8700 time 2016-09-18 13:30:11.270019
> /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/KernelDevice.cc: 370: FAILED assert(r == 0)
> 
> ceph version v11.0.0-2309-g9096ad3 (9096ad37f2c0798c26d7784fb4e7a781feb72cb8)
> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7fdf4f73811b]
> 2: (KernelDevice::aio_submit(IOContext*)+0x76d) [0x7fdf4f597dbd]
> 3: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0xcbd) [0x7fdf4f575b6d]
> 4: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xe9) [0x7fdf4f576c79]
> 5: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x6d) [0x7fdf4f579a6d]
> 6: (BlueRocksWritableFile::Sync()+0x4e) [0x7fdf4f58f25e]
> 7: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x139) [0x7fdf4f686699]
> 8: (rocksdb::WritableFileWriter::Sync(bool)+0x88) [0x7fdf4f687238]
> 9: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool)+0x13cf) [0x7fdf4f5dea2f]
> 10: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x27) [0x7fdf4f5df637]
> 11: (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x5b) [0x7fdf4f51814b]
> 12: (BlueStore::_kv_sync_thread()+0xf5a) [0x7fdf4f4e5ffa]
> 13: (BlueStore::KVSyncThread::entry()+0xd) [0x7fdf4f4f3a6d]
> 14: (()+0x80a4) [0x7fdf4b7a50a4]
> 15: (clone()+0x6d) [0x7fdf4a61e04d]
> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> 
> Thanks,
> Nitin
> 
> 
>> 
>> -----Original Message-----
>> From: Varada Kari
>> Sent: Sunday, September 18, 2016 6:59 PM
>> To: Kamble, Nitin A
>> Cc: Somnath Roy; Sage Weil; Ceph Development
>> Subject: Re: Bluestore OSD support in ceph-disk
>> 
>> If you are not running with latest master, could you please retry with latest master. https://github.com/ceph/ceph/pull/11095 should solve the problem.
>> 
>> if you are hitting the problem with the latest master, please post the logs in shared location like google drive or pastebin etc...
>> 
>> Varada
>> 
>> On Monday 19 September 2016 05:58 AM, Kamble, Nitin A wrote:
>>> I find the ceph-osd processes which are taking 100% cpu are all have common log last line.
>>> 
>>> 
>>> It means the log rotation has triggered, and it takes forever to finish.
>>> host5:~ # ls -lh /var/log/ceph/ceph-osd.24*
>>> -rw-r----- 1 ceph ceph    0 Sep 18 17:00 /var/log/ceph/ceph-osd.24.log
>>> -rw-r----- 1 ceph ceph 1.4G Sep 18 17:00
>>> /var/log/ceph/ceph-osd.24.log-20160918
>>> 
>>> host5:~ # tail /var/log/ceph/ceph-osd.24.log-20160918
>>> 2016-09-18 11:36:18.292275 7fab858dc700 10 bluefs get_usage bdev 2 
>>> free 160031571968 (149 GB) / 160032612352 (149 GB), used 0%
>>> 2016-09-18 11:36:18.292279 7fab858dc700 10 bluefs _flush
>>> 0x7fac47a5dd00 ignoring, length 3310 < min_flush_size 65536
>>> 2016-09-18 11:36:18.292280 7fab858dc700 10 bluefs _flush
>>> 0x7fac47a5dd00 ignoring, length 3310 < min_flush_size 65536
>>> 2016-09-18 11:36:18.292281 7fab858dc700 10 bluefs _fsync
>>> 0x7fac47a5dd00 file(ino 24 size 0x3d7cdc5 mtime 2016-09-18
>>> 11:36:04.164949 bdev 0 extents
>>> [0:0xe100000+d00000,0:0xf200000+e00000,1:0x10000000+2100000,0:0x10000
>>> 0
>>> +200000])
>>> 2016-09-18 11:36:18.292286 7fab858dc700 10 bluefs _flush
>>> 0x7fac47a5dd00 0x1b10000~cee to file(ino 24 size 0x3d7cdc5 mtime
>>> 2016-09-18 11:36:04.164949 bdev 0 extents
>>> [0:0xe100000+d00000,0:0xf200000+e00000,1:0x10000000+2100000,0:0x10000
>>> 0
>>> +200000])
>>> 2016-09-18 11:36:18.292289 7fab858dc700 10 bluefs _flush_range
>>> 0x7fac47a5dd00 pos 0x1b10000 0x1b10000~cee to file(ino 24 size
>>> 0x3d7cdc5 mtime 2016-09-18 11:36:04.164949 bdev 0 extents
>>> [0:0xe100000+d00000,0:0xf200000+e00000,1:0x10000000+2100000,0:0x10000
>>> 0
>>> +200000])
>>> 2016-09-18 11:36:18.292292 7fab858dc700 20 bluefs _flush_range file 
>>> now file(ino 24 size 0x3d7cdc5 mtime 2016-09-18 11:36:04.164949 bdev 
>>> 0 extents
>>> [0:0xe100000+d00000,0:0xf200000+e00000,1:0x10000000+2100000,0:0x10000
>>> 0
>>> +200000])
>>> 2016-09-18 11:36:18.292296 7fab858dc700 20 bluefs _flush_range in
>>> 1:0x10000000+2100000 x_off 0x10000
>>> 2016-09-18 11:36:18.292297 7fab858dc700 20 bluefs _flush_range 
>>> caching tail of 0xcee and padding block with zeros
>>> 2016-09-18 17:00:01.276990 7fab738b8700 -1 received  signal: Hangup 
>>> from  PID: 89063 task name: killall -q -1 ceph-mon ceph-mds ceph-osd 
>>> ceph-fuse radosgw  UID: 0
>>> 
>>> Further one of the osd process has crashed with this in the log:
>>> 
>>> 2016-09-18 13:30:11.274012 7fdf399b8700 -1 
>>> /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/
>>> B
>>> UILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/KernelDevice.cc: In 
>>> function 'virtual void KernelDevice::aio_submit(IOContext*)' thread
>>> 7fdf399b8700 time 2016-09-18 13:30:11.270019 
>>> /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/
>>> B
>>> UILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/KernelDevice.cc: 370:
>>> FAILED assert(r == 0)
>>> 
>>> ceph version v11.0.0-2309-g9096ad3
>>> (9096ad37f2c0798c26d7784fb4e7a781feb72cb8)
>>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>> const*)+0x8b) [0x7fdf4f73811b]
>>> 2: (KernelDevice::aio_submit(IOContext*)+0x76d) [0x7fdf4f597dbd]
>>> 3: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned
>>> long)+0xcbd) [0x7fdf4f575b6d]
>>> 4: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xe9) [0x7fdf4f576c79]
>>> 5: (BlueFS::_fsync(BlueFS::FileWriter*,
>>> std::unique_lock<std::mutex>&)+0x6d) [0x7fdf4f579a6d]
>>> 6: (BlueRocksWritableFile::Sync()+0x4e) [0x7fdf4f58f25e]
>>> 7: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x139)
>>> [0x7fdf4f686699]
>>> 8: (rocksdb::WritableFileWriter::Sync(bool)+0x88) [0x7fdf4f687238]
>>> 9: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, 
>>> rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, 
>>> unsigned long, bool)+0x13cf) [0x7fdf4f5dea2f]
>>> 10: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&,
>>> rocksdb::WriteBatch*)+0x27) [0x7fdf4f5df637]
>>> 11:
>>> (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::Tr
>>> a
>>> nsactionImpl>)+0x5b) [0x7fdf4f51814b]
>>> 12: (BlueStore::_kv_sync_thread()+0xf5a) [0x7fdf4f4e5ffa]
>>> 13: (BlueStore::KVSyncThread::entry()+0xd) [0x7fdf4f4f3a6d]
>>> 14: (()+0x80a4) [0x7fdf4b7a50a4]
>>> 15: (clone()+0x6d) [0x7fdf4a61e04d]
>>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>> 
>>> This time I have captured the log with debug bluefs = 20/20
>>> 
>>> Is there a good place where I can upload the trail of the log for sharing?
>>> 
>>> Thanks,
>>> Nitin
>>> 
>>> 
>>> 
>>> 
>> 
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Bluestore OSD support in ceph-disk
  2016-09-20  5:47                                             ` Kamble, Nitin A
@ 2016-09-20 11:53                                               ` Sage Weil
  0 siblings, 0 replies; 28+ messages in thread
From: Sage Weil @ 2016-09-20 11:53 UTC (permalink / raw)
  To: Kamble, Nitin A; +Cc: Somnath Roy, Varada Kari, Ceph Development

On Tue, 20 Sep 2016, Kamble, Nitin A wrote:
> I could not respond earlier as I am in the SNIA storage developer conference in
> Santa Clara this week. 
> BTW if any of you are in this conference feel free to pull me aside.
> 
> 
> Here is the trail of the log from the point of assert. 
> 
> https://app.box.com/s/lei6g8ddp2j1y29p6gvpjjut9okzzkpd

These are usually painful to track down. EBADF means we passed a bad file 
descriptor, which usually means that some other thread accidentaly closed 
our fd.  Is this reproducible?  If so, and strace -f log is usually enough 
to find the problem.

Thanks!
sage

> 
> Thanks,
> Nitin
> 
> > On Sep 18, 2016, at 10:32 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> > 
> > Not what I thought, some corruption it seems..Please post a verbose log if possible..
> > 
> > -----Original Message-----
> > From: Kamble, Nitin A [mailto:Nitin.Kamble@Teradata.com] 
> > Sent: Sunday, September 18, 2016 10:24 PM
> > To: Somnath Roy
> > Cc: Varada Kari; Sage Weil; Ceph Development
> > Subject: Re: Bluestore OSD support in ceph-disk
> > 
> > 
> >> On Sep 18, 2016, at 8:25 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> >> 
> >> The crash Nitin is getting is different. I think it could be related to aio limit of the Linux/disk. Check the device nr_requests and queue_depth settings. If it is related to Linux (syslog should be having that if I can recall), increase fs.aio-max-nr.
> >> There should be an error string printed in the log before assert. Search with " aio submit got" in the ceph-osd.<num>.log.
> >> 
> >> Thanks & Regards
> >> Somnath
> > 
> > Here is further information if it helps in understanding the assert.
> > 
> > host4:/ # cat /proc/sys/fs/aio-max-nr
> > 65536
> > host4:/ # cat /sys/block/sdk/device/queue_depth
> > 256
> > host4:/ # cat /sys/block/sdk/queue/nr_requests
> > 128
> > 
> > As seen below, there is an "aio submit got" error just before the assert.
> > I did not find anything related in dmesg
> > 
> > 2016-09-18 13:30:11.011673 7fdf399b8700 20 bluefs _flush_range in 0:0xdd00000+400000 x_off 0x3ec257
> > 2016-09-18 13:30:11.011674 7fdf399b8700 20 bluefs _flush_range using partial tail 0x257
> > 2016-09-18 13:30:11.011676 7fdf399b8700 20 bluefs _flush_range waiting for previous aio to complete
> > 2016-09-18 13:30:11.011711 7fdf399b8700 20 bluefs _flush_range h 0x7fe026503400 pos now 0x3c00000
> > 2016-09-18 13:30:11.011732 7fdf399b8700 10 bluefs _flush 0x7fe026503400 ignoring, length 15358 < min_flush_size 65536
> > 2016-09-18 13:30:11.011932 7fdf399b8700 10 bluefs get_usage bdev 0 free 41943040 (40960 kB) / 268431360 (255 MB), used 84%
> > 2016-09-18 13:30:11.011935 7fdf399b8700 10 bluefs get_usage bdev 1 free 77204553728 (73628 MB) / 79456886784 (75775 MB), used 2%
> > 2016-09-18 13:30:11.011938 7fdf399b8700 10 bluefs get_usage bdev 2 free 160031571968 (149 GB) / 160032612352 (149 GB), used 0%
> > 2016-09-18 13:30:11.011940 7fdf399b8700 10 bluefs _flush 0x7fe026503400 ignoring, length 15381 < min_flush_size 65536
> > 2016-09-18 13:30:11.011941 7fdf399b8700 10 bluefs _flush 0x7fe026503400 ignoring, length 15381 < min_flush_size 65536
> > 2016-09-18 13:30:11.011942 7fdf399b8700 10 bluefs _fsync 0x7fe026503400 file(ino 23 size 0x3d2a7de mtime 2016-09-18 13:28:24.844259 bdev 0 extents [0:0x9500000+100000,0:0x9a00000+1200000,0:0xb000000+1300000,0:0xc700000+1200000,0:0xdd00000+400000,1:0xe100000+200000])
> > 2016-09-18 13:30:11.011946 7fdf399b8700 10 bluefs _flush 0x7fe026503400 0x3c00000~3c15 to file(ino 23 size 0x3d2a7de mtime 2016-09-18 13:28:24.844259 bdev 0 extents [0:0x9500000+100000,0:0x9a00000+1200000,0:0xb000000+1300000,0:0xc700000+1200000,0:0xdd00000+400000,1:0xe100000+200000])
> > 2016-09-18 13:30:11.011949 7fdf399b8700 10 bluefs _flush_range 0x7fe026503400 pos 0x3c00000 0x3c00000~3c15 to file(ino 23 size 0x3d2a7de mtime 2016-09-18 13:28:24.844259 bdev 0 extents [0:0x9500000+100000,0:0x9a00000+1200000,0:0xb000000+1300000,0:0xc700000+1200000,0:0xdd00000+400000,1:0xe100000+200000])
> > 2016-09-18 13:30:11.011953 7fdf399b8700 20 bluefs _flush_range file now file(ino 23 size 0x3d2a7de mtime 2016-09-18 13:28:24.844259 bdev 0 extents [0:0x9500000+100000,0:0x9a00000+1200000,0:0xb000000+1300000,0:0xc700000+1200000,0:0xdd00000+400000,1:0xe100000+200000])
> > 2016-09-18 13:30:11.011956 7fdf399b8700 20 bluefs _flush_range in 1:0xe100000+200000 x_off 0x0
> > 2016-09-18 13:30:11.011958 7fdf399b8700 20 bluefs _flush_range caching tail of 0xc15 and padding block with zeros
> > 2016-09-18 13:30:11.270003 7fdf399b8700 -1 bdev(/var/lib/ceph/osd/ceph-18/block.wal)  aio submit got (9) Bad file descriptor
> > 2016-09-18 13:30:11.274012 7fdf399b8700 -1 /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/KernelDevice.cc: In function 'virtual void KernelDevice::aio_submit(IOContext*)' thread 7fdf399b8700 time 2016-09-18 13:30:11.270019
> > /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/KernelDevice.cc: 370: FAILED assert(r == 0)
> > 
> > ceph version v11.0.0-2309-g9096ad3 (9096ad37f2c0798c26d7784fb4e7a781feb72cb8)
> > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7fdf4f73811b]
> > 2: (KernelDevice::aio_submit(IOContext*)+0x76d) [0x7fdf4f597dbd]
> > 3: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0xcbd) [0x7fdf4f575b6d]
> > 4: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xe9) [0x7fdf4f576c79]
> > 5: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x6d) [0x7fdf4f579a6d]
> > 6: (BlueRocksWritableFile::Sync()+0x4e) [0x7fdf4f58f25e]
> > 7: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x139) [0x7fdf4f686699]
> > 8: (rocksdb::WritableFileWriter::Sync(bool)+0x88) [0x7fdf4f687238]
> > 9: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool)+0x13cf) [0x7fdf4f5dea2f]
> > 10: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x27) [0x7fdf4f5df637]
> > 11: (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x5b) [0x7fdf4f51814b]
> > 12: (BlueStore::_kv_sync_thread()+0xf5a) [0x7fdf4f4e5ffa]
> > 13: (BlueStore::KVSyncThread::entry()+0xd) [0x7fdf4f4f3a6d]
> > 14: (()+0x80a4) [0x7fdf4b7a50a4]
> > 15: (clone()+0x6d) [0x7fdf4a61e04d]
> > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> > 
> > Thanks,
> > Nitin
> > 
> > 
> >> 
> >> -----Original Message-----
> >> From: Varada Kari
> >> Sent: Sunday, September 18, 2016 6:59 PM
> >> To: Kamble, Nitin A
> >> Cc: Somnath Roy; Sage Weil; Ceph Development
> >> Subject: Re: Bluestore OSD support in ceph-disk
> >> 
> >> If you are not running with latest master, could you please retry with latest master. https://github.com/ceph/ceph/pull/11095 should solve the problem.
> >> 
> >> if you are hitting the problem with the latest master, please post the logs in shared location like google drive or pastebin etc...
> >> 
> >> Varada
> >> 
> >> On Monday 19 September 2016 05:58 AM, Kamble, Nitin A wrote:
> >>> I find the ceph-osd processes which are taking 100% cpu are all have common log last line.
> >>> 
> >>> 
> >>> It means the log rotation has triggered, and it takes forever to finish.
> >>> host5:~ # ls -lh /var/log/ceph/ceph-osd.24*
> >>> -rw-r----- 1 ceph ceph    0 Sep 18 17:00 /var/log/ceph/ceph-osd.24.log
> >>> -rw-r----- 1 ceph ceph 1.4G Sep 18 17:00
> >>> /var/log/ceph/ceph-osd.24.log-20160918
> >>> 
> >>> host5:~ # tail /var/log/ceph/ceph-osd.24.log-20160918
> >>> 2016-09-18 11:36:18.292275 7fab858dc700 10 bluefs get_usage bdev 2 
> >>> free 160031571968 (149 GB) / 160032612352 (149 GB), used 0%
> >>> 2016-09-18 11:36:18.292279 7fab858dc700 10 bluefs _flush
> >>> 0x7fac47a5dd00 ignoring, length 3310 < min_flush_size 65536
> >>> 2016-09-18 11:36:18.292280 7fab858dc700 10 bluefs _flush
> >>> 0x7fac47a5dd00 ignoring, length 3310 < min_flush_size 65536
> >>> 2016-09-18 11:36:18.292281 7fab858dc700 10 bluefs _fsync
> >>> 0x7fac47a5dd00 file(ino 24 size 0x3d7cdc5 mtime 2016-09-18
> >>> 11:36:04.164949 bdev 0 extents
> >>> [0:0xe100000+d00000,0:0xf200000+e00000,1:0x10000000+2100000,0:0x10000
> >>> 0
> >>> +200000])
> >>> 2016-09-18 11:36:18.292286 7fab858dc700 10 bluefs _flush
> >>> 0x7fac47a5dd00 0x1b10000~cee to file(ino 24 size 0x3d7cdc5 mtime
> >>> 2016-09-18 11:36:04.164949 bdev 0 extents
> >>> [0:0xe100000+d00000,0:0xf200000+e00000,1:0x10000000+2100000,0:0x10000
> >>> 0
> >>> +200000])
> >>> 2016-09-18 11:36:18.292289 7fab858dc700 10 bluefs _flush_range
> >>> 0x7fac47a5dd00 pos 0x1b10000 0x1b10000~cee to file(ino 24 size
> >>> 0x3d7cdc5 mtime 2016-09-18 11:36:04.164949 bdev 0 extents
> >>> [0:0xe100000+d00000,0:0xf200000+e00000,1:0x10000000+2100000,0:0x10000
> >>> 0
> >>> +200000])
> >>> 2016-09-18 11:36:18.292292 7fab858dc700 20 bluefs _flush_range file 
> >>> now file(ino 24 size 0x3d7cdc5 mtime 2016-09-18 11:36:04.164949 bdev 
> >>> 0 extents
> >>> [0:0xe100000+d00000,0:0xf200000+e00000,1:0x10000000+2100000,0:0x10000
> >>> 0
> >>> +200000])
> >>> 2016-09-18 11:36:18.292296 7fab858dc700 20 bluefs _flush_range in
> >>> 1:0x10000000+2100000 x_off 0x10000
> >>> 2016-09-18 11:36:18.292297 7fab858dc700 20 bluefs _flush_range 
> >>> caching tail of 0xcee and padding block with zeros
> >>> 2016-09-18 17:00:01.276990 7fab738b8700 -1 received  signal: Hangup 
> >>> from  PID: 89063 task name: killall -q -1 ceph-mon ceph-mds ceph-osd 
> >>> ceph-fuse radosgw  UID: 0
> >>> 
> >>> Further one of the osd process has crashed with this in the log:
> >>> 
> >>> 2016-09-18 13:30:11.274012 7fdf399b8700 -1 
> >>> /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/
> >>> B
> >>> UILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/KernelDevice.cc: In 
> >>> function 'virtual void KernelDevice::aio_submit(IOContext*)' thread
> >>> 7fdf399b8700 time 2016-09-18 13:30:11.270019 
> >>> /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild/
> >>> B
> >>> UILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/KernelDevice.cc: 370:
> >>> FAILED assert(r == 0)
> >>> 
> >>> ceph version v11.0.0-2309-g9096ad3
> >>> (9096ad37f2c0798c26d7784fb4e7a781feb72cb8)
> >>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> >>> const*)+0x8b) [0x7fdf4f73811b]
> >>> 2: (KernelDevice::aio_submit(IOContext*)+0x76d) [0x7fdf4f597dbd]
> >>> 3: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned
> >>> long)+0xcbd) [0x7fdf4f575b6d]
> >>> 4: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xe9) [0x7fdf4f576c79]
> >>> 5: (BlueFS::_fsync(BlueFS::FileWriter*,
> >>> std::unique_lock<std::mutex>&)+0x6d) [0x7fdf4f579a6d]
> >>> 6: (BlueRocksWritableFile::Sync()+0x4e) [0x7fdf4f58f25e]
> >>> 7: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x139)
> >>> [0x7fdf4f686699]
> >>> 8: (rocksdb::WritableFileWriter::Sync(bool)+0x88) [0x7fdf4f687238]
> >>> 9: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, 
> >>> rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, 
> >>> unsigned long, bool)+0x13cf) [0x7fdf4f5dea2f]
> >>> 10: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&,
> >>> rocksdb::WriteBatch*)+0x27) [0x7fdf4f5df637]
> >>> 11:
> >>> (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::Tr
> >>> a
> >>> nsactionImpl>)+0x5b) [0x7fdf4f51814b]
> >>> 12: (BlueStore::_kv_sync_thread()+0xf5a) [0x7fdf4f4e5ffa]
> >>> 13: (BlueStore::KVSyncThread::entry()+0xd) [0x7fdf4f4f3a6d]
> >>> 14: (()+0x80a4) [0x7fdf4b7a50a4]
> >>> 15: (clone()+0x6d) [0x7fdf4a61e04d]
> >>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> >>> 
> >>> This time I have captured the log with debug bluefs = 20/20
> >>> 
> >>> Is there a good place where I can upload the trail of the log for sharing?
> >>> 
> >>> Thanks,
> >>> Nitin
> >>> 
> >>> 
> >>> 
> >>> 
> >> 
> >> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2016-09-20 11:54 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-14 19:10 Best latest commit to run bluestore Kamble, Nitin A
2016-09-14 19:13 ` Somnath Roy
2016-09-14 19:47   ` Kamble, Nitin A
2016-09-15 18:23     ` Bluestore OSD support in ceph-disk Kamble, Nitin A
2016-09-15 18:34       ` Sage Weil
2016-09-15 18:46         ` Kamble, Nitin A
2016-09-15 18:54           ` Sage Weil
2016-09-16  6:43             ` Kamble, Nitin A
2016-09-16 18:38               ` Kamble, Nitin A
2016-09-16 18:43                 ` Somnath Roy
2016-09-16 19:00                   ` Kamble, Nitin A
2016-09-16 19:23                     ` Somnath Roy
2016-09-16 20:25                       ` Kamble, Nitin A
2016-09-16 20:36                         ` Somnath Roy
2016-09-16 20:54                         ` Sage Weil
2016-09-16 23:09                           ` Kamble, Nitin A
2016-09-16 23:15                             ` Somnath Roy
2016-09-18  6:41                               ` Kamble, Nitin A
2016-09-18  7:06                                 ` Varada Kari
2016-09-19  0:28                                   ` Kamble, Nitin A
2016-09-19  1:58                                     ` Varada Kari
2016-09-19  3:25                                       ` Somnath Roy
2016-09-19  5:24                                         ` Kamble, Nitin A
2016-09-19  5:32                                           ` Somnath Roy
2016-09-20  5:47                                             ` Kamble, Nitin A
2016-09-20 11:53                                               ` Sage Weil
2016-09-17 14:14                             ` Sage Weil
2016-09-18  6:41                               ` Kamble, Nitin A

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.