All of lore.kernel.org
 help / color / mirror / Atom feed
* latency compare between 2t NVME SSD P3500 and bluestore
@ 2017-07-12  9:15 攀刘
  2017-07-12 12:02 ` Sage Weil
  0 siblings, 1 reply; 14+ messages in thread
From: 攀刘 @ 2017-07-12  9:15 UTC (permalink / raw)
  To: Ceph Development

Hi Cephers,

I did some experiment today to compare the latency between one
P3500(2T nvme SSD) and bluestore(fio + libfio_objectstore.so):

For iodepth = 1, the random write latency of bluestore is 276.91us,
compare with 14.71 of SSD, big overhead.

I also test iodepth = 16, Still, there is a big overhead.(143 us -> 642 us)

What is your opinion?

Thanks
Pan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: latency compare between 2t NVME SSD P3500 and bluestore
  2017-07-12  9:15 latency compare between 2t NVME SSD P3500 and bluestore 攀刘
@ 2017-07-12 12:02 ` Sage Weil
  2017-07-12 12:45   ` 攀刘
  0 siblings, 1 reply; 14+ messages in thread
From: Sage Weil @ 2017-07-12 12:02 UTC (permalink / raw)
  To: 攀刘; +Cc: Ceph Development

[-- Attachment #1: Type: TEXT/PLAIN, Size: 982 bytes --]

On Wed, 12 Jul 2017, 攀刘 wrote:
> Hi Cephers,
> 
> I did some experiment today to compare the latency between one
> P3500(2T nvme SSD) and bluestore(fio + libfio_objectstore.so):
> 
> For iodepth = 1, the random write latency of bluestore is 276.91us,
> compare with 14.71 of SSD, big overhead.
> 
> I also test iodepth = 16, Still, there is a big overhead.(143 us -> 642 us)
> 
> What is your opinion?

There is a lot of work that bluestore is doing over the raw device as it 
is implementing all of the metadata tracking, checksumming, allocation, 
and so on.  There's definitely lots of room for improvement, but I'm 
not sure you can expect to see latencies in the 10s of us.  That said, it 
would be interesting to see an updated flamegraph to see where the time is 
being spent and where we can slim this down.  On a new nvme it's possible 
we can do away with some of the complexity of, say, the allocator, since 
the FTL is performing a lot of the same work anyway.

sage

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: latency compare between 2t NVME SSD P3500 and bluestore
  2017-07-12 12:02 ` Sage Weil
@ 2017-07-12 12:45   ` 攀刘
  2017-07-12 13:55     ` Sage Weil
  0 siblings, 1 reply; 14+ messages in thread
From: 攀刘 @ 2017-07-12 12:45 UTC (permalink / raw)
  To: Sage Weil; +Cc: Ceph Development, p.zhou

Hi Sage,

Yes, I totally understand bluestore did much more things than a raw
disk, but the current overhead is a little too big to our usage. I
will compare bluestore with XFS(also has metadata tracking,
allocation, and so on), and to see if XFS also has such impact.

I would like to give a flamegraph later, but from the perfcounter, we
could find most of time were spent in "kv_lat".

For FTL, yes, it is a good idea, after we get the flame graph, we
could discuss which part could be improved by FTL, firmware, even open
channel.





2017-07-12 20:02 GMT+08:00 Sage Weil <sage@newdream.net>:
> On Wed, 12 Jul 2017, 攀刘 wrote:
>> Hi Cephers,
>>
>> I did some experiment today to compare the latency between one
>> P3500(2T nvme SSD) and bluestore(fio + libfio_objectstore.so):
>>
>> For iodepth = 1, the random write latency of bluestore is 276.91us,
>> compare with 14.71 of SSD, big overhead.
>>
>> I also test iodepth = 16, Still, there is a big overhead.(143 us -> 642 us)
>>
>> What is your opinion?
>
> There is a lot of work that bluestore is doing over the raw device as it
> is implementing all of the metadata tracking, checksumming, allocation,
> and so on.  There's definitely lots of room for improvement, but I'm
> not sure you can expect to see latencies in the 10s of us.  That said, it
> would be interesting to see an updated flamegraph to see where the time is
> being spent and where we can slim this down.  On a new nvme it's possible
> we can do away with some of the complexity of, say, the allocator, since
> the FTL is performing a lot of the same work anyway.
>
> sage

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: latency compare between 2t NVME SSD P3500 and bluestore
  2017-07-12 12:45   ` 攀刘
@ 2017-07-12 13:55     ` Sage Weil
  2017-07-12 16:25       ` 攀刘
  0 siblings, 1 reply; 14+ messages in thread
From: Sage Weil @ 2017-07-12 13:55 UTC (permalink / raw)
  To: 攀刘; +Cc: Ceph Development, p.zhou

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2469 bytes --]

On Wed, 12 Jul 2017, 攀刘 wrote:
> Hi Sage,
> 
> Yes, I totally understand bluestore did much more things than a raw
> disk, but the current overhead is a little too big to our usage. I
> will compare bluestore with XFS(also has metadata tracking,
> allocation, and so on), and to see if XFS also has such impact.
> 
> I would like to give a flamegraph later, but from the perfcounter, we
> could find most of time were spent in "kv_lat".

That's rocksdb.  And yeah, I think it's pretty clear that either rocksdb 
needs some serious work to really keep up with nvme (or optane) or (more 
likely) we need an alternate kv backend that is targetting high speed 
flash.  I suspect the latter makes the most sense, and I believe there are 
various efforts at Intel looking at alternatives but no winner just yet.

Looking a bit further out, I think a new kv library that natively targets 
peristent memory (e.g., something built on pmem.io) will be the right 
solution.  Although at that point, it's probbaly a question of whether we 
have pmem for metadata and 3D NAND for data or pure pmem; in the latter 
case a complete replacement for bluestore would make more sense.

> For FTL, yes, it is a good idea, after we get the flame graph, we
> could discuss which part could be improved by FTL, firmware, even open
> channel.

Yep!
sage




> 
> 
> 
> 
> 
> 2017-07-12 20:02 GMT+08:00 Sage Weil <sage@newdream.net>:
> > On Wed, 12 Jul 2017, 攀刘 wrote:
> >> Hi Cephers,
> >>
> >> I did some experiment today to compare the latency between one
> >> P3500(2T nvme SSD) and bluestore(fio + libfio_objectstore.so):
> >>
> >> For iodepth = 1, the random write latency of bluestore is 276.91us,
> >> compare with 14.71 of SSD, big overhead.
> >>
> >> I also test iodepth = 16, Still, there is a big overhead.(143 us -> 642 us)
> >>
> >> What is your opinion?
> >
> > There is a lot of work that bluestore is doing over the raw device as it
> > is implementing all of the metadata tracking, checksumming, allocation,
> > and so on.  There's definitely lots of room for improvement, but I'm
> > not sure you can expect to see latencies in the 10s of us.  That said, it
> > would be interesting to see an updated flamegraph to see where the time is
> > being spent and where we can slim this down.  On a new nvme it's possible
> > we can do away with some of the complexity of, say, the allocator, since
> > the FTL is performing a lot of the same work anyway.
> >
> > sage
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: latency compare between 2t NVME SSD P3500 and bluestore
  2017-07-12 13:55     ` Sage Weil
@ 2017-07-12 16:25       ` 攀刘
  2017-07-12 16:34         ` Xiaoxi Chen
  0 siblings, 1 reply; 14+ messages in thread
From: 攀刘 @ 2017-07-12 16:25 UTC (permalink / raw)
  To: Sage Weil; +Cc: Ceph Development, p.zhou, 20702390

Hi Sage,

Indeed, I have an idea which hold a long time.

Do we really need a heavy k/v database to store metadata? Especially
for fast disks.... Introduce a third-party database also make
difficulty for maintenance (maybe because of my limited database
knowledge) ...

Let's suppose:
1) The max pg number in one osd is limited(in my experience, 100~200
pgs per osd is best performance)
2) The max number of objects in one pg is limited, because of disk space.

Then, how about this: pre-allocate metadata locations in metadata partition。

Part a SSD into two or three partitions(same as bluestore), instead of
using kv database, just store metadata directly in one disk
partition(we call it metadata partition). Inside this metadata
partition, we store several data structures:
1) One hash table of PGs, key is PG id, value is another hash
table(key is object index of this pg, value is object metadata, and
object location in data partition).
2) A free object location list.

And other extra things...

The max pgs belongs to one OSD can be limited by options, so I believe
the metadata partition should not be big. We could load all metadata
into RAM if RAM is really big, or part of them and controlled by LRU,
or just read, modify, and write back to disk when needed.

Do you think this idea reasonable? At least, I believe this kind of
new storage engine will be much faster.

Thanks
Pan

2017-07-12 21:55 GMT+08:00 Sage Weil <sage@newdream.net>:
> On Wed, 12 Jul 2017, 攀刘 wrote:
>> Hi Sage,
>>
>> Yes, I totally understand bluestore did much more things than a raw
>> disk, but the current overhead is a little too big to our usage. I
>> will compare bluestore with XFS(also has metadata tracking,
>> allocation, and so on), and to see if XFS also has such impact.
>>
>> I would like to give a flamegraph later, but from the perfcounter, we
>> could find most of time were spent in "kv_lat".
>
> That's rocksdb.  And yeah, I think it's pretty clear that either rocksdb
> needs some serious work to really keep up with nvme (or optane) or (more
> likely) we need an alternate kv backend that is targetting high speed
> flash.  I suspect the latter makes the most sense, and I believe there are
> various efforts at Intel looking at alternatives but no winner just yet.
>
> Looking a bit further out, I think a new kv library that natively targets
> peristent memory (e.g., something built on pmem.io) will be the right
> solution.  Although at that point, it's probbaly a question of whether we
> have pmem for metadata and 3D NAND for data or pure pmem; in the latter
> case a complete replacement for bluestore would make more sense.
>
>> For FTL, yes, it is a good idea, after we get the flame graph, we
>> could discuss which part could be improved by FTL, firmware, even open
>> channel.
>
> Yep!
> sage
>
>
>
>
>>
>>
>>
>>
>>
>> 2017-07-12 20:02 GMT+08:00 Sage Weil <sage@newdream.net>:
>> > On Wed, 12 Jul 2017, 攀刘 wrote:
>> >> Hi Cephers,
>> >>
>> >> I did some experiment today to compare the latency between one
>> >> P3500(2T nvme SSD) and bluestore(fio + libfio_objectstore.so):
>> >>
>> >> For iodepth = 1, the random write latency of bluestore is 276.91us,
>> >> compare with 14.71 of SSD, big overhead.
>> >>
>> >> I also test iodepth = 16, Still, there is a big overhead.(143 us -> 642 us)
>> >>
>> >> What is your opinion?
>> >
>> > There is a lot of work that bluestore is doing over the raw device as it
>> > is implementing all of the metadata tracking, checksumming, allocation,
>> > and so on.  There's definitely lots of room for improvement, but I'm
>> > not sure you can expect to see latencies in the 10s of us.  That said, it
>> > would be interesting to see an updated flamegraph to see where the time is
>> > being spent and where we can slim this down.  On a new nvme it's possible
>> > we can do away with some of the complexity of, say, the allocator, since
>> > the FTL is performing a lot of the same work anyway.
>> >
>> > sage
>>
>>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: latency compare between 2t NVME SSD P3500 and bluestore
  2017-07-12 16:25       ` 攀刘
@ 2017-07-12 16:34         ` Xiaoxi Chen
  2017-07-14  1:47           ` xiaoyan li
  0 siblings, 1 reply; 14+ messages in thread
From: Xiaoxi Chen @ 2017-07-12 16:34 UTC (permalink / raw)
  To: 攀刘; +Cc: Sage Weil, Ceph Development, p.zhou, 20702390

FWIW, one thing that KVDB can provide is transaction support, which is
important as we need to update several metadata(Onode, allocator map,
and WAL for small write) transactionally.



2017-07-13 0:25 GMT+08:00 攀刘 <liupan1111@gmail.com>:
> Hi Sage,
>
> Indeed, I have an idea which hold a long time.
>
> Do we really need a heavy k/v database to store metadata? Especially
> for fast disks.... Introduce a third-party database also make
> difficulty for maintenance (maybe because of my limited database
> knowledge) ...
>
> Let's suppose:
> 1) The max pg number in one osd is limited(in my experience, 100~200
> pgs per osd is best performance)
> 2) The max number of objects in one pg is limited, because of disk space.
>
> Then, how about this: pre-allocate metadata locations in metadata partition。
>
> Part a SSD into two or three partitions(same as bluestore), instead of
> using kv database, just store metadata directly in one disk
> partition(we call it metadata partition). Inside this metadata
> partition, we store several data structures:
> 1) One hash table of PGs, key is PG id, value is another hash
> table(key is object index of this pg, value is object metadata, and
> object location in data partition).
> 2) A free object location list.
>
> And other extra things...
>
> The max pgs belongs to one OSD can be limited by options, so I believe
> the metadata partition should not be big. We could load all metadata
> into RAM if RAM is really big, or part of them and controlled by LRU,
> or just read, modify, and write back to disk when needed.
>
> Do you think this idea reasonable? At least, I believe this kind of
> new storage engine will be much faster.
>
> Thanks
> Pan
>
> 2017-07-12 21:55 GMT+08:00 Sage Weil <sage@newdream.net>:
>> On Wed, 12 Jul 2017, 攀刘 wrote:
>>> Hi Sage,
>>>
>>> Yes, I totally understand bluestore did much more things than a raw
>>> disk, but the current overhead is a little too big to our usage. I
>>> will compare bluestore with XFS(also has metadata tracking,
>>> allocation, and so on), and to see if XFS also has such impact.
>>>
>>> I would like to give a flamegraph later, but from the perfcounter, we
>>> could find most of time were spent in "kv_lat".
>>
>> That's rocksdb.  And yeah, I think it's pretty clear that either rocksdb
>> needs some serious work to really keep up with nvme (or optane) or (more
>> likely) we need an alternate kv backend that is targetting high speed
>> flash.  I suspect the latter makes the most sense, and I believe there are
>> various efforts at Intel looking at alternatives but no winner just yet.
>>
>> Looking a bit further out, I think a new kv library that natively targets
>> peristent memory (e.g., something built on pmem.io) will be the right
>> solution.  Although at that point, it's probbaly a question of whether we
>> have pmem for metadata and 3D NAND for data or pure pmem; in the latter
>> case a complete replacement for bluestore would make more sense.
>>
>>> For FTL, yes, it is a good idea, after we get the flame graph, we
>>> could discuss which part could be improved by FTL, firmware, even open
>>> channel.
>>
>> Yep!
>> sage
>>
>>
>>
>>
>>>
>>>
>>>
>>>
>>>
>>> 2017-07-12 20:02 GMT+08:00 Sage Weil <sage@newdream.net>:
>>> > On Wed, 12 Jul 2017, 攀刘 wrote:
>>> >> Hi Cephers,
>>> >>
>>> >> I did some experiment today to compare the latency between one
>>> >> P3500(2T nvme SSD) and bluestore(fio + libfio_objectstore.so):
>>> >>
>>> >> For iodepth = 1, the random write latency of bluestore is 276.91us,
>>> >> compare with 14.71 of SSD, big overhead.
>>> >>
>>> >> I also test iodepth = 16, Still, there is a big overhead.(143 us -> 642 us)
>>> >>
>>> >> What is your opinion?
>>> >
>>> > There is a lot of work that bluestore is doing over the raw device as it
>>> > is implementing all of the metadata tracking, checksumming, allocation,
>>> > and so on.  There's definitely lots of room for improvement, but I'm
>>> > not sure you can expect to see latencies in the 10s of us.  That said, it
>>> > would be interesting to see an updated flamegraph to see where the time is
>>> > being spent and where we can slim this down.  On a new nvme it's possible
>>> > we can do away with some of the complexity of, say, the allocator, since
>>> > the FTL is performing a lot of the same work anyway.
>>> >
>>> > sage
>>>
>>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: latency compare between 2t NVME SSD P3500 and bluestore
  2017-07-12 16:34         ` Xiaoxi Chen
@ 2017-07-14  1:47           ` xiaoyan li
  2017-07-14  1:54             ` Mark Nelson
  2017-07-14  2:49             ` Ma, Jianpeng
  0 siblings, 2 replies; 14+ messages in thread
From: xiaoyan li @ 2017-07-14  1:47 UTC (permalink / raw)
  To: Xiaoxi Chen
  Cc: 攀刘, Sage Weil, Ceph Development, p.zhou, 20702390

Hi,
I am concerned about the rocksdb impact on bluestore whole IO path. I
did some test with bluestore fio plugin.
For example, I got following data from the log when I did bluestore
fio test with numjobs=64 and iopath=32. It seems that for every txc,
most of the time spends on queued and commiting states.
state time span(us)
state_prepare_lat 386
state_aio_wait_lat 430
state_io_done_lat 0
state_kv_queued_lat 7926
state_kv_commiting_lat 30653
state_kv_done_lat 4

        "state_kv_queued_lat": {
            "avgcount": 349076566,
            "sum": 1214245.959889817,
            "avgtime": 0.003478451
        },
        "state_kv_commiting_lat": {
            "avgcount": 174538283,
            "sum": 5612849.022306266,
            "avgtime": 0.032158268
        },


And same time, to submit (174538283/3509556 = 49) txcs every time only
takes 1024us, which is much less than commiting_lat 30653us.
        "kv_lat": {
            "avgcount": 3509556,
            "sum": 3594.365142193,
            "avgtime": 0.001024165
        },

The time between state_kv_queued_lat and state_kv_commiting_lat:
https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L8349
https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L8366
https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L7741

I am still investigating why it spends so long time on
kv_commiting_lat, but from above data I doubt it is the problem of
rocksdb.
Please correct me if I misunderstood anything.

Lisa


On Thu, Jul 13, 2017 at 12:34 AM, Xiaoxi Chen <superdebuger@gmail.com> wrote:
> FWIW, one thing that KVDB can provide is transaction support, which is
> important as we need to update several metadata(Onode, allocator map,
> and WAL for small write) transactionally.
>
>
>
> 2017-07-13 0:25 GMT+08:00 攀刘 <liupan1111@gmail.com>:
>> Hi Sage,
>>
>> Indeed, I have an idea which hold a long time.
>>
>> Do we really need a heavy k/v database to store metadata? Especially
>> for fast disks.... Introduce a third-party database also make
>> difficulty for maintenance (maybe because of my limited database
>> knowledge) ...
>>
>> Let's suppose:
>> 1) The max pg number in one osd is limited(in my experience, 100~200
>> pgs per osd is best performance)
>> 2) The max number of objects in one pg is limited, because of disk space.
>>
>> Then, how about this: pre-allocate metadata locations in metadata partition。
>>
>> Part a SSD into two or three partitions(same as bluestore), instead of
>> using kv database, just store metadata directly in one disk
>> partition(we call it metadata partition). Inside this metadata
>> partition, we store several data structures:
>> 1) One hash table of PGs, key is PG id, value is another hash
>> table(key is object index of this pg, value is object metadata, and
>> object location in data partition).
>> 2) A free object location list.
>>
>> And other extra things...
>>
>> The max pgs belongs to one OSD can be limited by options, so I believe
>> the metadata partition should not be big. We could load all metadata
>> into RAM if RAM is really big, or part of them and controlled by LRU,
>> or just read, modify, and write back to disk when needed.
>>
>> Do you think this idea reasonable? At least, I believe this kind of
>> new storage engine will be much faster.
>>
>> Thanks
>> Pan
>>
>> 2017-07-12 21:55 GMT+08:00 Sage Weil <sage@newdream.net>:
>>> On Wed, 12 Jul 2017, 攀刘 wrote:
>>>> Hi Sage,
>>>>
>>>> Yes, I totally understand bluestore did much more things than a raw
>>>> disk, but the current overhead is a little too big to our usage. I
>>>> will compare bluestore with XFS(also has metadata tracking,
>>>> allocation, and so on), and to see if XFS also has such impact.
>>>>
>>>> I would like to give a flamegraph later, but from the perfcounter, we
>>>> could find most of time were spent in "kv_lat".
>>>
>>> That's rocksdb.  And yeah, I think it's pretty clear that either rocksdb
>>> needs some serious work to really keep up with nvme (or optane) or (more
>>> likely) we need an alternate kv backend that is targetting high speed
>>> flash.  I suspect the latter makes the most sense, and I believe there are
>>> various efforts at Intel looking at alternatives but no winner just yet.
>>>
>>> Looking a bit further out, I think a new kv library that natively targets
>>> peristent memory (e.g., something built on pmem.io) will be the right
>>> solution.  Although at that point, it's probbaly a question of whether we
>>> have pmem for metadata and 3D NAND for data or pure pmem; in the latter
>>> case a complete replacement for bluestore would make more sense.
>>>
>>>> For FTL, yes, it is a good idea, after we get the flame graph, we
>>>> could discuss which part could be improved by FTL, firmware, even open
>>>> channel.
>>>
>>> Yep!
>>> sage
>>>
>>>
>>>
>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 2017-07-12 20:02 GMT+08:00 Sage Weil <sage@newdream.net>:
>>>> > On Wed, 12 Jul 2017, 攀刘 wrote:
>>>> >> Hi Cephers,
>>>> >>
>>>> >> I did some experiment today to compare the latency between one
>>>> >> P3500(2T nvme SSD) and bluestore(fio + libfio_objectstore.so):
>>>> >>
>>>> >> For iodepth = 1, the random write latency of bluestore is 276.91us,
>>>> >> compare with 14.71 of SSD, big overhead.
>>>> >>
>>>> >> I also test iodepth = 16, Still, there is a big overhead.(143 us -> 642 us)
>>>> >>
>>>> >> What is your opinion?
>>>> >
>>>> > There is a lot of work that bluestore is doing over the raw device as it
>>>> > is implementing all of the metadata tracking, checksumming, allocation,
>>>> > and so on.  There's definitely lots of room for improvement, but I'm
>>>> > not sure you can expect to see latencies in the 10s of us.  That said, it
>>>> > would be interesting to see an updated flamegraph to see where the time is
>>>> > being spent and where we can slim this down.  On a new nvme it's possible
>>>> > we can do away with some of the complexity of, say, the allocator, since
>>>> > the FTL is performing a lot of the same work anyway.
>>>> >
>>>> > sage
>>>>
>>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best wishes
Lisa

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: latency compare between 2t NVME SSD P3500 and bluestore
  2017-07-14  1:47           ` xiaoyan li
@ 2017-07-14  1:54             ` Mark Nelson
  2017-07-14  8:29               ` xiaoyan li
  2017-07-14  2:49             ` Ma, Jianpeng
  1 sibling, 1 reply; 14+ messages in thread
From: Mark Nelson @ 2017-07-14  1:54 UTC (permalink / raw)
  To: xiaoyan li, Xiaoxi Chen
  Cc: 攀刘, Sage Weil, Ceph Development, p.zhou, 20702390

Hi Li,

You may want to try my wallclock profiler to see where time is being 
spent during your test.  It is located here:

https://github.com/markhpc/gdbprof

You can run it like:

sudo gdb -ex 'set pagination off' -ex 'attach <pid>' -ex 'source 
/home/ubuntu/src/markhpc/gdbprof/gdbprof.py' -ex 'profile begin' -ex 'quit'

Mark

On 07/13/2017 08:47 PM, xiaoyan li wrote:
> Hi,
> I am concerned about the rocksdb impact on bluestore whole IO path. I
> did some test with bluestore fio plugin.
> For example, I got following data from the log when I did bluestore
> fio test with numjobs=64 and iopath=32. It seems that for every txc,
> most of the time spends on queued and commiting states.
> state time span(us)
> state_prepare_lat 386
> state_aio_wait_lat 430
> state_io_done_lat 0
> state_kv_queued_lat 7926
> state_kv_commiting_lat 30653
> state_kv_done_lat 4
>
>         "state_kv_queued_lat": {
>             "avgcount": 349076566,
>             "sum": 1214245.959889817,
>             "avgtime": 0.003478451
>         },
>         "state_kv_commiting_lat": {
>             "avgcount": 174538283,
>             "sum": 5612849.022306266,
>             "avgtime": 0.032158268
>         },
>
>
> And same time, to submit (174538283/3509556 = 49) txcs every time only
> takes 1024us, which is much less than commiting_lat 30653us.
>         "kv_lat": {
>             "avgcount": 3509556,
>             "sum": 3594.365142193,
>             "avgtime": 0.001024165
>         },
>
> The time between state_kv_queued_lat and state_kv_commiting_lat:
> https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L8349
> https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L8366
> https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L7741
>
> I am still investigating why it spends so long time on
> kv_commiting_lat, but from above data I doubt it is the problem of
> rocksdb.
> Please correct me if I misunderstood anything.
>
> Lisa
>
>
> On Thu, Jul 13, 2017 at 12:34 AM, Xiaoxi Chen <superdebuger@gmail.com> wrote:
>> FWIW, one thing that KVDB can provide is transaction support, which is
>> important as we need to update several metadata(Onode, allocator map,
>> and WAL for small write) transactionally.
>>
>>
>>
>> 2017-07-13 0:25 GMT+08:00 攀刘 <liupan1111@gmail.com>:
>>> Hi Sage,
>>>
>>> Indeed, I have an idea which hold a long time.
>>>
>>> Do we really need a heavy k/v database to store metadata? Especially
>>> for fast disks.... Introduce a third-party database also make
>>> difficulty for maintenance (maybe because of my limited database
>>> knowledge) ...
>>>
>>> Let's suppose:
>>> 1) The max pg number in one osd is limited(in my experience, 100~200
>>> pgs per osd is best performance)
>>> 2) The max number of objects in one pg is limited, because of disk space.
>>>
>>> Then, how about this: pre-allocate metadata locations in metadata partition。
>>>
>>> Part a SSD into two or three partitions(same as bluestore), instead of
>>> using kv database, just store metadata directly in one disk
>>> partition(we call it metadata partition). Inside this metadata
>>> partition, we store several data structures:
>>> 1) One hash table of PGs, key is PG id, value is another hash
>>> table(key is object index of this pg, value is object metadata, and
>>> object location in data partition).
>>> 2) A free object location list.
>>>
>>> And other extra things...
>>>
>>> The max pgs belongs to one OSD can be limited by options, so I believe
>>> the metadata partition should not be big. We could load all metadata
>>> into RAM if RAM is really big, or part of them and controlled by LRU,
>>> or just read, modify, and write back to disk when needed.
>>>
>>> Do you think this idea reasonable? At least, I believe this kind of
>>> new storage engine will be much faster.
>>>
>>> Thanks
>>> Pan
>>>
>>> 2017-07-12 21:55 GMT+08:00 Sage Weil <sage@newdream.net>:
>>>> On Wed, 12 Jul 2017, 攀刘 wrote:
>>>>> Hi Sage,
>>>>>
>>>>> Yes, I totally understand bluestore did much more things than a raw
>>>>> disk, but the current overhead is a little too big to our usage. I
>>>>> will compare bluestore with XFS(also has metadata tracking,
>>>>> allocation, and so on), and to see if XFS also has such impact.
>>>>>
>>>>> I would like to give a flamegraph later, but from the perfcounter, we
>>>>> could find most of time were spent in "kv_lat".
>>>>
>>>> That's rocksdb.  And yeah, I think it's pretty clear that either rocksdb
>>>> needs some serious work to really keep up with nvme (or optane) or (more
>>>> likely) we need an alternate kv backend that is targetting high speed
>>>> flash.  I suspect the latter makes the most sense, and I believe there are
>>>> various efforts at Intel looking at alternatives but no winner just yet.
>>>>
>>>> Looking a bit further out, I think a new kv library that natively targets
>>>> peristent memory (e.g., something built on pmem.io) will be the right
>>>> solution.  Although at that point, it's probbaly a question of whether we
>>>> have pmem for metadata and 3D NAND for data or pure pmem; in the latter
>>>> case a complete replacement for bluestore would make more sense.
>>>>
>>>>> For FTL, yes, it is a good idea, after we get the flame graph, we
>>>>> could discuss which part could be improved by FTL, firmware, even open
>>>>> channel.
>>>>
>>>> Yep!
>>>> sage
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2017-07-12 20:02 GMT+08:00 Sage Weil <sage@newdream.net>:
>>>>>> On Wed, 12 Jul 2017, 攀刘 wrote:
>>>>>>> Hi Cephers,
>>>>>>>
>>>>>>> I did some experiment today to compare the latency between one
>>>>>>> P3500(2T nvme SSD) and bluestore(fio + libfio_objectstore.so):
>>>>>>>
>>>>>>> For iodepth = 1, the random write latency of bluestore is 276.91us,
>>>>>>> compare with 14.71 of SSD, big overhead.
>>>>>>>
>>>>>>> I also test iodepth = 16, Still, there is a big overhead.(143 us -> 642 us)
>>>>>>>
>>>>>>> What is your opinion?
>>>>>>
>>>>>> There is a lot of work that bluestore is doing over the raw device as it
>>>>>> is implementing all of the metadata tracking, checksumming, allocation,
>>>>>> and so on.  There's definitely lots of room for improvement, but I'm
>>>>>> not sure you can expect to see latencies in the 10s of us.  That said, it
>>>>>> would be interesting to see an updated flamegraph to see where the time is
>>>>>> being spent and where we can slim this down.  On a new nvme it's possible
>>>>>> we can do away with some of the complexity of, say, the allocator, since
>>>>>> the FTL is performing a lot of the same work anyway.
>>>>>>
>>>>>> sage
>>>>>
>>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: latency compare between 2t NVME SSD P3500 and bluestore
  2017-07-14  1:47           ` xiaoyan li
  2017-07-14  1:54             ` Mark Nelson
@ 2017-07-14  2:49             ` Ma, Jianpeng
  2017-07-14  8:22               ` xiaoyan li
  1 sibling, 1 reply; 14+ messages in thread
From: Ma, Jianpeng @ 2017-07-14  2:49 UTC (permalink / raw)
  To: xiaoyan li, Xiaoxi Chen; +Cc: ??, Sage Weil, Ceph Development, p.zhou, 20702390

"state_kv_commiting_lat -  kv_lat" mean the latency for thread " _kv_finalize_thread".
If is this correctly?

Jianpeng

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of xiaoyan li
Sent: Friday, July 14, 2017 9:47 AM
To: Xiaoxi Chen <superdebuger@gmail.com>
Cc: 攀刘 <liupan1111@gmail.com>; Sage Weil <sage@newdream.net>; Ceph Development <ceph-devel@vger.kernel.org>; p.zhou@alibaba-inc.com; 20702390@qq.com
Subject: Re: latency compare between 2t NVME SSD P3500 and bluestore

Hi,
I am concerned about the rocksdb impact on bluestore whole IO path. I did some test with bluestore fio plugin.
For example, I got following data from the log when I did bluestore fio test with numjobs=64 and iopath=32. It seems that for every txc, most of the time spends on queued and commiting states.
state time span(us)
state_prepare_lat 386
state_aio_wait_lat 430
state_io_done_lat 0
state_kv_queued_lat 7926
state_kv_commiting_lat 30653
state_kv_done_lat 4

        "state_kv_queued_lat": {
            "avgcount": 349076566,
            "sum": 1214245.959889817,
            "avgtime": 0.003478451
        },
        "state_kv_commiting_lat": {
            "avgcount": 174538283,
            "sum": 5612849.022306266,
            "avgtime": 0.032158268
        },


And same time, to submit (174538283/3509556 = 49) txcs every time only takes 1024us, which is much less than commiting_lat 30653us.
        "kv_lat": {
            "avgcount": 3509556,
            "sum": 3594.365142193,
            "avgtime": 0.001024165
        },

The time between state_kv_queued_lat and state_kv_commiting_lat:
https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L8349
https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L8366
https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L7741

I am still investigating why it spends so long time on kv_commiting_lat, but from above data I doubt it is the problem of rocksdb.
Please correct me if I misunderstood anything.

Lisa


On Thu, Jul 13, 2017 at 12:34 AM, Xiaoxi Chen <superdebuger@gmail.com> wrote:
> FWIW, one thing that KVDB can provide is transaction support, which is 
> important as we need to update several metadata(Onode, allocator map, 
> and WAL for small write) transactionally.
>
>
>
> 2017-07-13 0:25 GMT+08:00 攀刘 <liupan1111@gmail.com>:
>> Hi Sage,
>>
>> Indeed, I have an idea which hold a long time.
>>
>> Do we really need a heavy k/v database to store metadata? Especially 
>> for fast disks.... Introduce a third-party database also make 
>> difficulty for maintenance (maybe because of my limited database
>> knowledge) ...
>>
>> Let's suppose:
>> 1) The max pg number in one osd is limited(in my experience, 100~200 
>> pgs per osd is best performance)
>> 2) The max number of objects in one pg is limited, because of disk space.
>>
>> Then, how about this: pre-allocate metadata locations in metadata 
>> partition。
>>
>> Part a SSD into two or three partitions(same as bluestore), instead 
>> of using kv database, just store metadata directly in one disk 
>> partition(we call it metadata partition). Inside this metadata 
>> partition, we store several data structures:
>> 1) One hash table of PGs, key is PG id, value is another hash 
>> table(key is object index of this pg, value is object metadata, and 
>> object location in data partition).
>> 2) A free object location list.
>>
>> And other extra things...
>>
>> The max pgs belongs to one OSD can be limited by options, so I 
>> believe the metadata partition should not be big. We could load all 
>> metadata into RAM if RAM is really big, or part of them and 
>> controlled by LRU, or just read, modify, and write back to disk when needed.
>>
>> Do you think this idea reasonable? At least, I believe this kind of 
>> new storage engine will be much faster.
>>
>> Thanks
>> Pan
>>
>> 2017-07-12 21:55 GMT+08:00 Sage Weil <sage@newdream.net>:
>>> On Wed, 12 Jul 2017, 攀刘 wrote:
>>>> Hi Sage,
>>>>
>>>> Yes, I totally understand bluestore did much more things than a raw 
>>>> disk, but the current overhead is a little too big to our usage. I 
>>>> will compare bluestore with XFS(also has metadata tracking, 
>>>> allocation, and so on), and to see if XFS also has such impact.
>>>>
>>>> I would like to give a flamegraph later, but from the perfcounter, 
>>>> we could find most of time were spent in "kv_lat".
>>>
>>> That's rocksdb.  And yeah, I think it's pretty clear that either 
>>> rocksdb needs some serious work to really keep up with nvme (or 
>>> optane) or (more
>>> likely) we need an alternate kv backend that is targetting high 
>>> speed flash.  I suspect the latter makes the most sense, and I 
>>> believe there are various efforts at Intel looking at alternatives but no winner just yet.
>>>
>>> Looking a bit further out, I think a new kv library that natively 
>>> targets peristent memory (e.g., something built on pmem.io) will be 
>>> the right solution.  Although at that point, it's probbaly a 
>>> question of whether we have pmem for metadata and 3D NAND for data 
>>> or pure pmem; in the latter case a complete replacement for bluestore would make more sense.
>>>
>>>> For FTL, yes, it is a good idea, after we get the flame graph, we 
>>>> could discuss which part could be improved by FTL, firmware, even 
>>>> open channel.
>>>
>>> Yep!
>>> sage
>>>
>>>
>>>
>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 2017-07-12 20:02 GMT+08:00 Sage Weil <sage@newdream.net>:
>>>> > On Wed, 12 Jul 2017, 攀刘 wrote:
>>>> >> Hi Cephers,
>>>> >>
>>>> >> I did some experiment today to compare the latency between one 
>>>> >> P3500(2T nvme SSD) and bluestore(fio + libfio_objectstore.so):
>>>> >>
>>>> >> For iodepth = 1, the random write latency of bluestore is 
>>>> >> 276.91us, compare with 14.71 of SSD, big overhead.
>>>> >>
>>>> >> I also test iodepth = 16, Still, there is a big overhead.(143 us 
>>>> >> -> 642 us)
>>>> >>
>>>> >> What is your opinion?
>>>> >
>>>> > There is a lot of work that bluestore is doing over the raw 
>>>> > device as it is implementing all of the metadata tracking, 
>>>> > checksumming, allocation, and so on.  There's definitely lots of 
>>>> > room for improvement, but I'm not sure you can expect to see 
>>>> > latencies in the 10s of us.  That said, it would be interesting 
>>>> > to see an updated flamegraph to see where the time is being spent 
>>>> > and where we can slim this down.  On a new nvme it's possible we 
>>>> > can do away with some of the complexity of, say, the allocator, since the FTL is performing a lot of the same work anyway.
>>>> >
>>>> > sage
>>>>
>>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html



--
Best wishes
Lisa
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: latency compare between 2t NVME SSD P3500 and bluestore
  2017-07-14  2:49             ` Ma, Jianpeng
@ 2017-07-14  8:22               ` xiaoyan li
  0 siblings, 0 replies; 14+ messages in thread
From: xiaoyan li @ 2017-07-14  8:22 UTC (permalink / raw)
  To: Ma, Jianpeng
  Cc: Xiaoxi Chen, ??, Sage Weil, Ceph Development, p.zhou, 20702390

On Fri, Jul 14, 2017 at 10:49 AM, Ma, Jianpeng <jianpeng.ma@intel.com> wrote:
> "state_kv_commiting_lat -  kv_lat" mean the latency for thread " _kv_finalize_thread".
> If is this correctly?
Not exactly true. state_kv_commiting_lat is per txc, kv_lat is per
_kv_sync_thread call, which handle kv update of txcs in the
kv_queue_unsubmitted.

>
> Jianpeng
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of xiaoyan li
> Sent: Friday, July 14, 2017 9:47 AM
> To: Xiaoxi Chen <superdebuger@gmail.com>
> Cc: 攀刘 <liupan1111@gmail.com>; Sage Weil <sage@newdream.net>; Ceph Development <ceph-devel@vger.kernel.org>; p.zhou@alibaba-inc.com; 20702390@qq.com
> Subject: Re: latency compare between 2t NVME SSD P3500 and bluestore
>
> Hi,
> I am concerned about the rocksdb impact on bluestore whole IO path. I did some test with bluestore fio plugin.
> For example, I got following data from the log when I did bluestore fio test with numjobs=64 and iopath=32. It seems that for every txc, most of the time spends on queued and commiting states.
> state time span(us)
> state_prepare_lat 386
> state_aio_wait_lat 430
> state_io_done_lat 0
> state_kv_queued_lat 7926
> state_kv_commiting_lat 30653
> state_kv_done_lat 4
>
>         "state_kv_queued_lat": {
>             "avgcount": 349076566,
>             "sum": 1214245.959889817,
>             "avgtime": 0.003478451
>         },
>         "state_kv_commiting_lat": {
>             "avgcount": 174538283,
>             "sum": 5612849.022306266,
>             "avgtime": 0.032158268
>         },
>
>
> And same time, to submit (174538283/3509556 = 49) txcs every time only takes 1024us, which is much less than commiting_lat 30653us.
>         "kv_lat": {
>             "avgcount": 3509556,
>             "sum": 3594.365142193,
>             "avgtime": 0.001024165
>         },
>
> The time between state_kv_queued_lat and state_kv_commiting_lat:
> https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L8349
> https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L8366
> https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L7741
>
> I am still investigating why it spends so long time on kv_commiting_lat, but from above data I doubt it is the problem of rocksdb.
> Please correct me if I misunderstood anything.
>
> Lisa
>
>
> On Thu, Jul 13, 2017 at 12:34 AM, Xiaoxi Chen <superdebuger@gmail.com> wrote:
>> FWIW, one thing that KVDB can provide is transaction support, which is
>> important as we need to update several metadata(Onode, allocator map,
>> and WAL for small write) transactionally.
>>
>>
>>
>> 2017-07-13 0:25 GMT+08:00 攀刘 <liupan1111@gmail.com>:
>>> Hi Sage,
>>>
>>> Indeed, I have an idea which hold a long time.
>>>
>>> Do we really need a heavy k/v database to store metadata? Especially
>>> for fast disks.... Introduce a third-party database also make
>>> difficulty for maintenance (maybe because of my limited database
>>> knowledge) ...
>>>
>>> Let's suppose:
>>> 1) The max pg number in one osd is limited(in my experience, 100~200
>>> pgs per osd is best performance)
>>> 2) The max number of objects in one pg is limited, because of disk space.
>>>
>>> Then, how about this: pre-allocate metadata locations in metadata
>>> partition。
>>>
>>> Part a SSD into two or three partitions(same as bluestore), instead
>>> of using kv database, just store metadata directly in one disk
>>> partition(we call it metadata partition). Inside this metadata
>>> partition, we store several data structures:
>>> 1) One hash table of PGs, key is PG id, value is another hash
>>> table(key is object index of this pg, value is object metadata, and
>>> object location in data partition).
>>> 2) A free object location list.
>>>
>>> And other extra things...
>>>
>>> The max pgs belongs to one OSD can be limited by options, so I
>>> believe the metadata partition should not be big. We could load all
>>> metadata into RAM if RAM is really big, or part of them and
>>> controlled by LRU, or just read, modify, and write back to disk when needed.
>>>
>>> Do you think this idea reasonable? At least, I believe this kind of
>>> new storage engine will be much faster.
>>>
>>> Thanks
>>> Pan
>>>
>>> 2017-07-12 21:55 GMT+08:00 Sage Weil <sage@newdream.net>:
>>>> On Wed, 12 Jul 2017, 攀刘 wrote:
>>>>> Hi Sage,
>>>>>
>>>>> Yes, I totally understand bluestore did much more things than a raw
>>>>> disk, but the current overhead is a little too big to our usage. I
>>>>> will compare bluestore with XFS(also has metadata tracking,
>>>>> allocation, and so on), and to see if XFS also has such impact.
>>>>>
>>>>> I would like to give a flamegraph later, but from the perfcounter,
>>>>> we could find most of time were spent in "kv_lat".
>>>>
>>>> That's rocksdb.  And yeah, I think it's pretty clear that either
>>>> rocksdb needs some serious work to really keep up with nvme (or
>>>> optane) or (more
>>>> likely) we need an alternate kv backend that is targetting high
>>>> speed flash.  I suspect the latter makes the most sense, and I
>>>> believe there are various efforts at Intel looking at alternatives but no winner just yet.
>>>>
>>>> Looking a bit further out, I think a new kv library that natively
>>>> targets peristent memory (e.g., something built on pmem.io) will be
>>>> the right solution.  Although at that point, it's probbaly a
>>>> question of whether we have pmem for metadata and 3D NAND for data
>>>> or pure pmem; in the latter case a complete replacement for bluestore would make more sense.
>>>>
>>>>> For FTL, yes, it is a good idea, after we get the flame graph, we
>>>>> could discuss which part could be improved by FTL, firmware, even
>>>>> open channel.
>>>>
>>>> Yep!
>>>> sage
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2017-07-12 20:02 GMT+08:00 Sage Weil <sage@newdream.net>:
>>>>> > On Wed, 12 Jul 2017, 攀刘 wrote:
>>>>> >> Hi Cephers,
>>>>> >>
>>>>> >> I did some experiment today to compare the latency between one
>>>>> >> P3500(2T nvme SSD) and bluestore(fio + libfio_objectstore.so):
>>>>> >>
>>>>> >> For iodepth = 1, the random write latency of bluestore is
>>>>> >> 276.91us, compare with 14.71 of SSD, big overhead.
>>>>> >>
>>>>> >> I also test iodepth = 16, Still, there is a big overhead.(143 us
>>>>> >> -> 642 us)
>>>>> >>
>>>>> >> What is your opinion?
>>>>> >
>>>>> > There is a lot of work that bluestore is doing over the raw
>>>>> > device as it is implementing all of the metadata tracking,
>>>>> > checksumming, allocation, and so on.  There's definitely lots of
>>>>> > room for improvement, but I'm not sure you can expect to see
>>>>> > latencies in the 10s of us.  That said, it would be interesting
>>>>> > to see an updated flamegraph to see where the time is being spent
>>>>> > and where we can slim this down.  On a new nvme it's possible we
>>>>> > can do away with some of the complexity of, say, the allocator, since the FTL is performing a lot of the same work anyway.
>>>>> >
>>>>> > sage
>>>>>
>>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Best wishes
> Lisa
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best wishes
Lisa

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: latency compare between 2t NVME SSD P3500 and bluestore
  2017-07-14  1:54             ` Mark Nelson
@ 2017-07-14  8:29               ` xiaoyan li
  2017-07-14  9:31                 ` Xiaoxi Chen
  0 siblings, 1 reply; 14+ messages in thread
From: xiaoyan li @ 2017-07-14  8:29 UTC (permalink / raw)
  To: Mark Nelson
  Cc: Xiaoxi Chen, 攀刘,
	Sage Weil, Ceph Development, p.zhou, 20702390

Here is the output of gdbprof: @Mark please have a look.
I copied _kv_sync_thread and _kv_finalize_thread here.
http://paste.openstack.org/show/615362/

Lisa

On Fri, Jul 14, 2017 at 9:54 AM, Mark Nelson <mnelson@redhat.com> wrote:
> Hi Li,
>
> You may want to try my wallclock profiler to see where time is being spent
> during your test.  It is located here:
>
> https://github.com/markhpc/gdbprof
>
> You can run it like:
>
> sudo gdb -ex 'set pagination off' -ex 'attach <pid>' -ex 'source
> /home/ubuntu/src/markhpc/gdbprof/gdbprof.py' -ex 'profile begin' -ex 'quit'
>
> Mark
>
>
> On 07/13/2017 08:47 PM, xiaoyan li wrote:
>>
>> Hi,
>> I am concerned about the rocksdb impact on bluestore whole IO path. I
>> did some test with bluestore fio plugin.
>> For example, I got following data from the log when I did bluestore
>> fio test with numjobs=64 and iopath=32. It seems that for every txc,
>> most of the time spends on queued and commiting states.
>> state time span(us)
>> state_prepare_lat 386
>> state_aio_wait_lat 430
>> state_io_done_lat 0
>> state_kv_queued_lat 7926
>> state_kv_commiting_lat 30653
>> state_kv_done_lat 4
>>
>>         "state_kv_queued_lat": {
>>             "avgcount": 349076566,
>>             "sum": 1214245.959889817,
>>             "avgtime": 0.003478451
>>         },
>>         "state_kv_commiting_lat": {
>>             "avgcount": 174538283,
>>             "sum": 5612849.022306266,
>>             "avgtime": 0.032158268
>>         },
>>
>>
>> And same time, to submit (174538283/3509556 = 49) txcs every time only
>> takes 1024us, which is much less than commiting_lat 30653us.
>>         "kv_lat": {
>>             "avgcount": 3509556,
>>             "sum": 3594.365142193,
>>             "avgtime": 0.001024165
>>         },
>>
>> The time between state_kv_queued_lat and state_kv_commiting_lat:
>>
>> https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L8349
>>
>> https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L8366
>>
>> https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L7741
>>
>> I am still investigating why it spends so long time on
>> kv_commiting_lat, but from above data I doubt it is the problem of
>> rocksdb.
>> Please correct me if I misunderstood anything.
>>
>> Lisa
>>
>>
>> On Thu, Jul 13, 2017 at 12:34 AM, Xiaoxi Chen <superdebuger@gmail.com>
>> wrote:
>>>
>>> FWIW, one thing that KVDB can provide is transaction support, which is
>>> important as we need to update several metadata(Onode, allocator map,
>>> and WAL for small write) transactionally.
>>>
>>>
>>>
>>> 2017-07-13 0:25 GMT+08:00 攀刘 <liupan1111@gmail.com>:
>>>>
>>>> Hi Sage,
>>>>
>>>> Indeed, I have an idea which hold a long time.
>>>>
>>>> Do we really need a heavy k/v database to store metadata? Especially
>>>> for fast disks.... Introduce a third-party database also make
>>>> difficulty for maintenance (maybe because of my limited database
>>>> knowledge) ...
>>>>
>>>> Let's suppose:
>>>> 1) The max pg number in one osd is limited(in my experience, 100~200
>>>> pgs per osd is best performance)
>>>> 2) The max number of objects in one pg is limited, because of disk
>>>> space.
>>>>
>>>> Then, how about this: pre-allocate metadata locations in metadata
>>>> partition。
>>>>
>>>> Part a SSD into two or three partitions(same as bluestore), instead of
>>>> using kv database, just store metadata directly in one disk
>>>> partition(we call it metadata partition). Inside this metadata
>>>> partition, we store several data structures:
>>>> 1) One hash table of PGs, key is PG id, value is another hash
>>>> table(key is object index of this pg, value is object metadata, and
>>>> object location in data partition).
>>>> 2) A free object location list.
>>>>
>>>> And other extra things...
>>>>
>>>> The max pgs belongs to one OSD can be limited by options, so I believe
>>>> the metadata partition should not be big. We could load all metadata
>>>> into RAM if RAM is really big, or part of them and controlled by LRU,
>>>> or just read, modify, and write back to disk when needed.
>>>>
>>>> Do you think this idea reasonable? At least, I believe this kind of
>>>> new storage engine will be much faster.
>>>>
>>>> Thanks
>>>> Pan
>>>>
>>>> 2017-07-12 21:55 GMT+08:00 Sage Weil <sage@newdream.net>:
>>>>>
>>>>> On Wed, 12 Jul 2017, 攀刘 wrote:
>>>>>>
>>>>>> Hi Sage,
>>>>>>
>>>>>> Yes, I totally understand bluestore did much more things than a raw
>>>>>> disk, but the current overhead is a little too big to our usage. I
>>>>>> will compare bluestore with XFS(also has metadata tracking,
>>>>>> allocation, and so on), and to see if XFS also has such impact.
>>>>>>
>>>>>> I would like to give a flamegraph later, but from the perfcounter, we
>>>>>> could find most of time were spent in "kv_lat".
>>>>>
>>>>>
>>>>> That's rocksdb.  And yeah, I think it's pretty clear that either
>>>>> rocksdb
>>>>> needs some serious work to really keep up with nvme (or optane) or
>>>>> (more
>>>>> likely) we need an alternate kv backend that is targetting high speed
>>>>> flash.  I suspect the latter makes the most sense, and I believe there
>>>>> are
>>>>> various efforts at Intel looking at alternatives but no winner just
>>>>> yet.
>>>>>
>>>>> Looking a bit further out, I think a new kv library that natively
>>>>> targets
>>>>> peristent memory (e.g., something built on pmem.io) will be the right
>>>>> solution.  Although at that point, it's probbaly a question of whether
>>>>> we
>>>>> have pmem for metadata and 3D NAND for data or pure pmem; in the latter
>>>>> case a complete replacement for bluestore would make more sense.
>>>>>
>>>>>> For FTL, yes, it is a good idea, after we get the flame graph, we
>>>>>> could discuss which part could be improved by FTL, firmware, even open
>>>>>> channel.
>>>>>
>>>>>
>>>>> Yep!
>>>>> sage
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2017-07-12 20:02 GMT+08:00 Sage Weil <sage@newdream.net>:
>>>>>>>
>>>>>>> On Wed, 12 Jul 2017, 攀刘 wrote:
>>>>>>>>
>>>>>>>> Hi Cephers,
>>>>>>>>
>>>>>>>> I did some experiment today to compare the latency between one
>>>>>>>> P3500(2T nvme SSD) and bluestore(fio + libfio_objectstore.so):
>>>>>>>>
>>>>>>>> For iodepth = 1, the random write latency of bluestore is 276.91us,
>>>>>>>> compare with 14.71 of SSD, big overhead.
>>>>>>>>
>>>>>>>> I also test iodepth = 16, Still, there is a big overhead.(143 us ->
>>>>>>>> 642 us)
>>>>>>>>
>>>>>>>> What is your opinion?
>>>>>>>
>>>>>>>
>>>>>>> There is a lot of work that bluestore is doing over the raw device as
>>>>>>> it
>>>>>>> is implementing all of the metadata tracking, checksumming,
>>>>>>> allocation,
>>>>>>> and so on.  There's definitely lots of room for improvement, but I'm
>>>>>>> not sure you can expect to see latencies in the 10s of us.  That
>>>>>>> said, it
>>>>>>> would be interesting to see an updated flamegraph to see where the
>>>>>>> time is
>>>>>>> being spent and where we can slim this down.  On a new nvme it's
>>>>>>> possible
>>>>>>> we can do away with some of the complexity of, say, the allocator,
>>>>>>> since
>>>>>>> the FTL is performing a lot of the same work anyway.
>>>>>>>
>>>>>>> sage
>>>>>>
>>>>>>
>>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>>
>



-- 
Best wishes
Lisa

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: latency compare between 2t NVME SSD P3500 and bluestore
  2017-07-14  8:29               ` xiaoyan li
@ 2017-07-14  9:31                 ` Xiaoxi Chen
  2017-07-14 12:54                   ` 攀刘
  0 siblings, 1 reply; 14+ messages in thread
From: Xiaoxi Chen @ 2017-07-14  9:31 UTC (permalink / raw)
  To: xiaoyan li
  Cc: Mark Nelson, 攀刘,
	Sage Weil, Ceph Development, p.zhou, 20702390

in state_kv_commit stage, db-> submit_transcation will be called and
all rocksdb insert-key logical is done here, as shown in gdbprof, key
comparison and lookup is happening here, but as db->submit_transcation
will set sync=false, which will leave the change to RocksDB WAL
(potential) is only in memory, not persistent to disk

The *submit* you refer to, is just submit an empty transaction with
sync=true, to flush all previous WAL persistently into disk.

Clearly the kv_commit is CPU intensive and kv_submit is (sequential)
IO intensive. So depending on the CPU/Disk speed ratio, one may see
different profiling result.  My previous test on HDD show opposite
result that kv_lat is pretty long.

2017-07-14 16:29 GMT+08:00 xiaoyan li <wisher2003@gmail.com>:
> Here is the output of gdbprof: @Mark please have a look.
> I copied _kv_sync_thread and _kv_finalize_thread here.
> http://paste.openstack.org/show/615362/
>
> Lisa
>
> On Fri, Jul 14, 2017 at 9:54 AM, Mark Nelson <mnelson@redhat.com> wrote:
>> Hi Li,
>>
>> You may want to try my wallclock profiler to see where time is being spent
>> during your test.  It is located here:
>>
>> https://github.com/markhpc/gdbprof
>>
>> You can run it like:
>>
>> sudo gdb -ex 'set pagination off' -ex 'attach <pid>' -ex 'source
>> /home/ubuntu/src/markhpc/gdbprof/gdbprof.py' -ex 'profile begin' -ex 'quit'
>>
>> Mark
>>
>>
>> On 07/13/2017 08:47 PM, xiaoyan li wrote:
>>>
>>> Hi,
>>> I am concerned about the rocksdb impact on bluestore whole IO path. I
>>> did some test with bluestore fio plugin.
>>> For example, I got following data from the log when I did bluestore
>>> fio test with numjobs=64 and iopath=32. It seems that for every txc,
>>> most of the time spends on queued and commiting states.
>>> state time span(us)
>>> state_prepare_lat 386
>>> state_aio_wait_lat 430
>>> state_io_done_lat 0
>>> state_kv_queued_lat 7926
>>> state_kv_commiting_lat 30653
>>> state_kv_done_lat 4
>>>
>>>         "state_kv_queued_lat": {
>>>             "avgcount": 349076566,
>>>             "sum": 1214245.959889817,
>>>             "avgtime": 0.003478451
>>>         },
>>>         "state_kv_commiting_lat": {
>>>             "avgcount": 174538283,
>>>             "sum": 5612849.022306266,
>>>             "avgtime": 0.032158268
>>>         },
>>>
>>>
>>> And same time, to submit (174538283/3509556 = 49) txcs every time only
>>> takes 1024us, which is much less than commiting_lat 30653us.
>>>         "kv_lat": {
>>>             "avgcount": 3509556,
>>>             "sum": 3594.365142193,
>>>             "avgtime": 0.001024165
>>>         },
>>>
>>> The time between state_kv_queued_lat and state_kv_commiting_lat:
>>>
>>> https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L8349
>>>
>>> https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L8366
>>>
>>> https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L7741
>>>
>>> I am still investigating why it spends so long time on
>>> kv_commiting_lat, but from above data I doubt it is the problem of
>>> rocksdb.
>>> Please correct me if I misunderstood anything.
>>>
>>> Lisa
>>>
>>>
>>> On Thu, Jul 13, 2017 at 12:34 AM, Xiaoxi Chen <superdebuger@gmail.com>
>>> wrote:
>>>>
>>>> FWIW, one thing that KVDB can provide is transaction support, which is
>>>> important as we need to update several metadata(Onode, allocator map,
>>>> and WAL for small write) transactionally.
>>>>
>>>>
>>>>
>>>> 2017-07-13 0:25 GMT+08:00 攀刘 <liupan1111@gmail.com>:
>>>>>
>>>>> Hi Sage,
>>>>>
>>>>> Indeed, I have an idea which hold a long time.
>>>>>
>>>>> Do we really need a heavy k/v database to store metadata? Especially
>>>>> for fast disks.... Introduce a third-party database also make
>>>>> difficulty for maintenance (maybe because of my limited database
>>>>> knowledge) ...
>>>>>
>>>>> Let's suppose:
>>>>> 1) The max pg number in one osd is limited(in my experience, 100~200
>>>>> pgs per osd is best performance)
>>>>> 2) The max number of objects in one pg is limited, because of disk
>>>>> space.
>>>>>
>>>>> Then, how about this: pre-allocate metadata locations in metadata
>>>>> partition。
>>>>>
>>>>> Part a SSD into two or three partitions(same as bluestore), instead of
>>>>> using kv database, just store metadata directly in one disk
>>>>> partition(we call it metadata partition). Inside this metadata
>>>>> partition, we store several data structures:
>>>>> 1) One hash table of PGs, key is PG id, value is another hash
>>>>> table(key is object index of this pg, value is object metadata, and
>>>>> object location in data partition).
>>>>> 2) A free object location list.
>>>>>
>>>>> And other extra things...
>>>>>
>>>>> The max pgs belongs to one OSD can be limited by options, so I believe
>>>>> the metadata partition should not be big. We could load all metadata
>>>>> into RAM if RAM is really big, or part of them and controlled by LRU,
>>>>> or just read, modify, and write back to disk when needed.
>>>>>
>>>>> Do you think this idea reasonable? At least, I believe this kind of
>>>>> new storage engine will be much faster.
>>>>>
>>>>> Thanks
>>>>> Pan
>>>>>
>>>>> 2017-07-12 21:55 GMT+08:00 Sage Weil <sage@newdream.net>:
>>>>>>
>>>>>> On Wed, 12 Jul 2017, 攀刘 wrote:
>>>>>>>
>>>>>>> Hi Sage,
>>>>>>>
>>>>>>> Yes, I totally understand bluestore did much more things than a raw
>>>>>>> disk, but the current overhead is a little too big to our usage. I
>>>>>>> will compare bluestore with XFS(also has metadata tracking,
>>>>>>> allocation, and so on), and to see if XFS also has such impact.
>>>>>>>
>>>>>>> I would like to give a flamegraph later, but from the perfcounter, we
>>>>>>> could find most of time were spent in "kv_lat".
>>>>>>
>>>>>>
>>>>>> That's rocksdb.  And yeah, I think it's pretty clear that either
>>>>>> rocksdb
>>>>>> needs some serious work to really keep up with nvme (or optane) or
>>>>>> (more
>>>>>> likely) we need an alternate kv backend that is targetting high speed
>>>>>> flash.  I suspect the latter makes the most sense, and I believe there
>>>>>> are
>>>>>> various efforts at Intel looking at alternatives but no winner just
>>>>>> yet.
>>>>>>
>>>>>> Looking a bit further out, I think a new kv library that natively
>>>>>> targets
>>>>>> peristent memory (e.g., something built on pmem.io) will be the right
>>>>>> solution.  Although at that point, it's probbaly a question of whether
>>>>>> we
>>>>>> have pmem for metadata and 3D NAND for data or pure pmem; in the latter
>>>>>> case a complete replacement for bluestore would make more sense.
>>>>>>
>>>>>>> For FTL, yes, it is a good idea, after we get the flame graph, we
>>>>>>> could discuss which part could be improved by FTL, firmware, even open
>>>>>>> channel.
>>>>>>
>>>>>>
>>>>>> Yep!
>>>>>> sage
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2017-07-12 20:02 GMT+08:00 Sage Weil <sage@newdream.net>:
>>>>>>>>
>>>>>>>> On Wed, 12 Jul 2017, 攀刘 wrote:
>>>>>>>>>
>>>>>>>>> Hi Cephers,
>>>>>>>>>
>>>>>>>>> I did some experiment today to compare the latency between one
>>>>>>>>> P3500(2T nvme SSD) and bluestore(fio + libfio_objectstore.so):
>>>>>>>>>
>>>>>>>>> For iodepth = 1, the random write latency of bluestore is 276.91us,
>>>>>>>>> compare with 14.71 of SSD, big overhead.
>>>>>>>>>
>>>>>>>>> I also test iodepth = 16, Still, there is a big overhead.(143 us ->
>>>>>>>>> 642 us)
>>>>>>>>>
>>>>>>>>> What is your opinion?
>>>>>>>>
>>>>>>>>
>>>>>>>> There is a lot of work that bluestore is doing over the raw device as
>>>>>>>> it
>>>>>>>> is implementing all of the metadata tracking, checksumming,
>>>>>>>> allocation,
>>>>>>>> and so on.  There's definitely lots of room for improvement, but I'm
>>>>>>>> not sure you can expect to see latencies in the 10s of us.  That
>>>>>>>> said, it
>>>>>>>> would be interesting to see an updated flamegraph to see where the
>>>>>>>> time is
>>>>>>>> being spent and where we can slim this down.  On a new nvme it's
>>>>>>>> possible
>>>>>>>> we can do away with some of the complexity of, say, the allocator,
>>>>>>>> since
>>>>>>>> the FTL is performing a lot of the same work anyway.
>>>>>>>>
>>>>>>>> sage
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>>
>>
>
>
>
> --
> Best wishes
> Lisa

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: latency compare between 2t NVME SSD P3500 and bluestore
  2017-07-14  9:31                 ` Xiaoxi Chen
@ 2017-07-14 12:54                   ` 攀刘
  2017-07-14 13:31                     ` Mark Nelson
  0 siblings, 1 reply; 14+ messages in thread
From: 攀刘 @ 2017-07-14 12:54 UTC (permalink / raw)
  To: Xiaoxi Chen
  Cc: xiaoyan li, Mark Nelson, Sage Weil, Ceph Development, p.zhou, 20702390

Hi Sage and Mark,

I did experiment locally for fio+libfio_ceph_bluestore.fio,
iodepth = 32, numjobs=64
and got the result below:

1) Without using gdbprof

   PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 75526 root      20   0 12.193g 7.056g 271672 R 99.0  5.6   4:20.98
bstore_kv_sync
 75527 root      20   0 12.193g 7.056g 271672 S 61.6  5.6   2:57.03
bstore_kv_final
 75504 root      20   0 12.193g 7.056g 271672 S 35.4  5.6   1:46.34 bstore_aio
 75524 root      20   0 12.193g 7.056g 271672 S 34.1  5.6   1:34.27 finisher
 75567 root      20   0 12.193g 7.056g 271672 S 10.6  5.6   0:26.32 fio

We could find the thread of store_kv_sync nearly occupied a core
completely(99%).

  write: IOPS=80.1k, BW=1251MiB/s (1312MB/s)(319GiB/260926msec)
    clat (usec): min=1792, max=52694, avg=25505.15, stdev=1731.77
     lat (usec): min=1852, max=52780, avg=25568.08, stdev=1732.09

From fio output, the avg flat is nearly 25.5ms.

From the perfcounter in the link:
http://paste.openstack.org/show/615380/

We could find :
kv_commit_lat: 12.1 ms
kv_lat:               12.1 ms
state_kv_queued_lat :  9.98 X 2 = 19.9 ms
state_kv_commiting_lat: 4.67 ms

So we should try to improve state_kv_queued_lat, right?

2) using gdbprof:
http://paste.openstack.org/show/615386/
I pasted the stack of thread KVFinalizeThread and KVSyncThread.

I will continue investigate, please let me know if you have any opinion.

Thanks
Pan

2017-07-14 17:31 GMT+08:00 Xiaoxi Chen <superdebuger@gmail.com>:
> in state_kv_commit stage, db-> submit_transcation will be called and
> all rocksdb insert-key logical is done here, as shown in gdbprof, key
> comparison and lookup is happening here, but as db->submit_transcation
> will set sync=false, which will leave the change to RocksDB WAL
> (potential) is only in memory, not persistent to disk
>
> The *submit* you refer to, is just submit an empty transaction with
> sync=true, to flush all previous WAL persistently into disk.
>
> Clearly the kv_commit is CPU intensive and kv_submit is (sequential)
> IO intensive. So depending on the CPU/Disk speed ratio, one may see
> different profiling result.  My previous test on HDD show opposite
> result that kv_lat is pretty long.
>
> 2017-07-14 16:29 GMT+08:00 xiaoyan li <wisher2003@gmail.com>:
>> Here is the output of gdbprof: @Mark please have a look.
>> I copied _kv_sync_thread and _kv_finalize_thread here.
>> http://paste.openstack.org/show/615362/
>>
>> Lisa
>>
>> On Fri, Jul 14, 2017 at 9:54 AM, Mark Nelson <mnelson@redhat.com> wrote:
>>> Hi Li,
>>>
>>> You may want to try my wallclock profiler to see where time is being spent
>>> during your test.  It is located here:
>>>
>>> https://github.com/markhpc/gdbprof
>>>
>>> You can run it like:
>>>
>>> sudo gdb -ex 'set pagination off' -ex 'attach <pid>' -ex 'source
>>> /home/ubuntu/src/markhpc/gdbprof/gdbprof.py' -ex 'profile begin' -ex 'quit'
>>>
>>> Mark
>>>
>>>
>>> On 07/13/2017 08:47 PM, xiaoyan li wrote:
>>>>
>>>> Hi,
>>>> I am concerned about the rocksdb impact on bluestore whole IO path. I
>>>> did some test with bluestore fio plugin.
>>>> For example, I got following data from the log when I did bluestore
>>>> fio test with numjobs=64 and iopath=32. It seems that for every txc,
>>>> most of the time spends on queued and commiting states.
>>>> state time span(us)
>>>> state_prepare_lat 386
>>>> state_aio_wait_lat 430
>>>> state_io_done_lat 0
>>>> state_kv_queued_lat 7926
>>>> state_kv_commiting_lat 30653
>>>> state_kv_done_lat 4
>>>>
>>>>         "state_kv_queued_lat": {
>>>>             "avgcount": 349076566,
>>>>             "sum": 1214245.959889817,
>>>>             "avgtime": 0.003478451
>>>>         },
>>>>         "state_kv_commiting_lat": {
>>>>             "avgcount": 174538283,
>>>>             "sum": 5612849.022306266,
>>>>             "avgtime": 0.032158268
>>>>         },
>>>>
>>>>
>>>> And same time, to submit (174538283/3509556 = 49) txcs every time only
>>>> takes 1024us, which is much less than commiting_lat 30653us.
>>>>         "kv_lat": {
>>>>             "avgcount": 3509556,
>>>>             "sum": 3594.365142193,
>>>>             "avgtime": 0.001024165
>>>>         },
>>>>
>>>> The time between state_kv_queued_lat and state_kv_commiting_lat:
>>>>
>>>> https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L8349
>>>>
>>>> https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L8366
>>>>
>>>> https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L7741
>>>>
>>>> I am still investigating why it spends so long time on
>>>> kv_commiting_lat, but from above data I doubt it is the problem of
>>>> rocksdb.
>>>> Please correct me if I misunderstood anything.
>>>>
>>>> Lisa
>>>>
>>>>
>>>> On Thu, Jul 13, 2017 at 12:34 AM, Xiaoxi Chen <superdebuger@gmail.com>
>>>> wrote:
>>>>>
>>>>> FWIW, one thing that KVDB can provide is transaction support, which is
>>>>> important as we need to update several metadata(Onode, allocator map,
>>>>> and WAL for small write) transactionally.
>>>>>
>>>>>
>>>>>
>>>>> 2017-07-13 0:25 GMT+08:00 攀刘 <liupan1111@gmail.com>:
>>>>>>
>>>>>> Hi Sage,
>>>>>>
>>>>>> Indeed, I have an idea which hold a long time.
>>>>>>
>>>>>> Do we really need a heavy k/v database to store metadata? Especially
>>>>>> for fast disks.... Introduce a third-party database also make
>>>>>> difficulty for maintenance (maybe because of my limited database
>>>>>> knowledge) ...
>>>>>>
>>>>>> Let's suppose:
>>>>>> 1) The max pg number in one osd is limited(in my experience, 100~200
>>>>>> pgs per osd is best performance)
>>>>>> 2) The max number of objects in one pg is limited, because of disk
>>>>>> space.
>>>>>>
>>>>>> Then, how about this: pre-allocate metadata locations in metadata
>>>>>> partition。
>>>>>>
>>>>>> Part a SSD into two or three partitions(same as bluestore), instead of
>>>>>> using kv database, just store metadata directly in one disk
>>>>>> partition(we call it metadata partition). Inside this metadata
>>>>>> partition, we store several data structures:
>>>>>> 1) One hash table of PGs, key is PG id, value is another hash
>>>>>> table(key is object index of this pg, value is object metadata, and
>>>>>> object location in data partition).
>>>>>> 2) A free object location list.
>>>>>>
>>>>>> And other extra things...
>>>>>>
>>>>>> The max pgs belongs to one OSD can be limited by options, so I believe
>>>>>> the metadata partition should not be big. We could load all metadata
>>>>>> into RAM if RAM is really big, or part of them and controlled by LRU,
>>>>>> or just read, modify, and write back to disk when needed.
>>>>>>
>>>>>> Do you think this idea reasonable? At least, I believe this kind of
>>>>>> new storage engine will be much faster.
>>>>>>
>>>>>> Thanks
>>>>>> Pan
>>>>>>
>>>>>> 2017-07-12 21:55 GMT+08:00 Sage Weil <sage@newdream.net>:
>>>>>>>
>>>>>>> On Wed, 12 Jul 2017, 攀刘 wrote:
>>>>>>>>
>>>>>>>> Hi Sage,
>>>>>>>>
>>>>>>>> Yes, I totally understand bluestore did much more things than a raw
>>>>>>>> disk, but the current overhead is a little too big to our usage. I
>>>>>>>> will compare bluestore with XFS(also has metadata tracking,
>>>>>>>> allocation, and so on), and to see if XFS also has such impact.
>>>>>>>>
>>>>>>>> I would like to give a flamegraph later, but from the perfcounter, we
>>>>>>>> could find most of time were spent in "kv_lat".
>>>>>>>
>>>>>>>
>>>>>>> That's rocksdb.  And yeah, I think it's pretty clear that either
>>>>>>> rocksdb
>>>>>>> needs some serious work to really keep up with nvme (or optane) or
>>>>>>> (more
>>>>>>> likely) we need an alternate kv backend that is targetting high speed
>>>>>>> flash.  I suspect the latter makes the most sense, and I believe there
>>>>>>> are
>>>>>>> various efforts at Intel looking at alternatives but no winner just
>>>>>>> yet.
>>>>>>>
>>>>>>> Looking a bit further out, I think a new kv library that natively
>>>>>>> targets
>>>>>>> peristent memory (e.g., something built on pmem.io) will be the right
>>>>>>> solution.  Although at that point, it's probbaly a question of whether
>>>>>>> we
>>>>>>> have pmem for metadata and 3D NAND for data or pure pmem; in the latter
>>>>>>> case a complete replacement for bluestore would make more sense.
>>>>>>>
>>>>>>>> For FTL, yes, it is a good idea, after we get the flame graph, we
>>>>>>>> could discuss which part could be improved by FTL, firmware, even open
>>>>>>>> channel.
>>>>>>>
>>>>>>>
>>>>>>> Yep!
>>>>>>> sage
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2017-07-12 20:02 GMT+08:00 Sage Weil <sage@newdream.net>:
>>>>>>>>>
>>>>>>>>> On Wed, 12 Jul 2017, 攀刘 wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Cephers,
>>>>>>>>>>
>>>>>>>>>> I did some experiment today to compare the latency between one
>>>>>>>>>> P3500(2T nvme SSD) and bluestore(fio + libfio_objectstore.so):
>>>>>>>>>>
>>>>>>>>>> For iodepth = 1, the random write latency of bluestore is 276.91us,
>>>>>>>>>> compare with 14.71 of SSD, big overhead.
>>>>>>>>>>
>>>>>>>>>> I also test iodepth = 16, Still, there is a big overhead.(143 us ->
>>>>>>>>>> 642 us)
>>>>>>>>>>
>>>>>>>>>> What is your opinion?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> There is a lot of work that bluestore is doing over the raw device as
>>>>>>>>> it
>>>>>>>>> is implementing all of the metadata tracking, checksumming,
>>>>>>>>> allocation,
>>>>>>>>> and so on.  There's definitely lots of room for improvement, but I'm
>>>>>>>>> not sure you can expect to see latencies in the 10s of us.  That
>>>>>>>>> said, it
>>>>>>>>> would be interesting to see an updated flamegraph to see where the
>>>>>>>>> time is
>>>>>>>>> being spent and where we can slim this down.  On a new nvme it's
>>>>>>>>> possible
>>>>>>>>> we can do away with some of the complexity of, say, the allocator,
>>>>>>>>> since
>>>>>>>>> the FTL is performing a lot of the same work anyway.
>>>>>>>>>
>>>>>>>>> sage
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>> --
>> Best wishes
>> Lisa

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: latency compare between 2t NVME SSD P3500 and bluestore
  2017-07-14 12:54                   ` 攀刘
@ 2017-07-14 13:31                     ` Mark Nelson
  0 siblings, 0 replies; 14+ messages in thread
From: Mark Nelson @ 2017-07-14 13:31 UTC (permalink / raw)
  To: 攀刘, Xiaoxi Chen
  Cc: xiaoyan li, Sage Weil, Ceph Development, p.zhou, 20702390

On 07/14/2017 07:54 AM, 攀刘 wrote:
> Hi Sage and Mark,
>
> I did experiment locally for fio+libfio_ceph_bluestore.fio,
> iodepth = 32, numjobs=64
> and got the result below:
>
> 1) Without using gdbprof
>
>    PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
>  75526 root      20   0 12.193g 7.056g 271672 R 99.0  5.6   4:20.98
> bstore_kv_sync
>  75527 root      20   0 12.193g 7.056g 271672 S 61.6  5.6   2:57.03
> bstore_kv_final
>  75504 root      20   0 12.193g 7.056g 271672 S 35.4  5.6   1:46.34 bstore_aio
>  75524 root      20   0 12.193g 7.056g 271672 S 34.1  5.6   1:34.27 finisher
>  75567 root      20   0 12.193g 7.056g 271672 S 10.6  5.6   0:26.32 fio
>
> We could find the thread of store_kv_sync nearly occupied a core
> completely(99%).
>
>   write: IOPS=80.1k, BW=1251MiB/s (1312MB/s)(319GiB/260926msec)
>     clat (usec): min=1792, max=52694, avg=25505.15, stdev=1731.77
>      lat (usec): min=1852, max=52780, avg=25568.08, stdev=1732.09
>
> From fio output, the avg flat is nearly 25.5ms.
>
> From the perfcounter in the link:
> http://paste.openstack.org/show/615380/
>
> We could find :
> kv_commit_lat: 12.1 ms
> kv_lat:               12.1 ms
> state_kv_queued_lat :  9.98 X 2 = 19.9 ms
> state_kv_commiting_lat: 4.67 ms
>
> So we should try to improve state_kv_queued_lat, right?
>
> 2) using gdbprof:
> http://paste.openstack.org/show/615386/
> I pasted the stack of thread KVFinalizeThread and KVSyncThread.

Lots of time spend bogged down in key comparison operations in rocksdb. 
I see that too, but usually in very high throughput scenarios.  How fast 
is your CPU?

I'm hoping that the PR from Li will allow us to make the buffers/sst 
files smaller and we can reduce some of that overhead:

https://github.com/ceph/rocksdb/pull/19

Mark

>
> I will continue investigate, please let me know if you have any opinion.
>
> Thanks
> Pan
>
> 2017-07-14 17:31 GMT+08:00 Xiaoxi Chen <superdebuger@gmail.com>:
>> in state_kv_commit stage, db-> submit_transcation will be called and
>> all rocksdb insert-key logical is done here, as shown in gdbprof, key
>> comparison and lookup is happening here, but as db->submit_transcation
>> will set sync=false, which will leave the change to RocksDB WAL
>> (potential) is only in memory, not persistent to disk
>>
>> The *submit* you refer to, is just submit an empty transaction with
>> sync=true, to flush all previous WAL persistently into disk.
>>
>> Clearly the kv_commit is CPU intensive and kv_submit is (sequential)
>> IO intensive. So depending on the CPU/Disk speed ratio, one may see
>> different profiling result.  My previous test on HDD show opposite
>> result that kv_lat is pretty long.
>>
>> 2017-07-14 16:29 GMT+08:00 xiaoyan li <wisher2003@gmail.com>:
>>> Here is the output of gdbprof: @Mark please have a look.
>>> I copied _kv_sync_thread and _kv_finalize_thread here.
>>> http://paste.openstack.org/show/615362/
>>>
>>> Lisa
>>>
>>> On Fri, Jul 14, 2017 at 9:54 AM, Mark Nelson <mnelson@redhat.com> wrote:
>>>> Hi Li,
>>>>
>>>> You may want to try my wallclock profiler to see where time is being spent
>>>> during your test.  It is located here:
>>>>
>>>> https://github.com/markhpc/gdbprof
>>>>
>>>> You can run it like:
>>>>
>>>> sudo gdb -ex 'set pagination off' -ex 'attach <pid>' -ex 'source
>>>> /home/ubuntu/src/markhpc/gdbprof/gdbprof.py' -ex 'profile begin' -ex 'quit'
>>>>
>>>> Mark
>>>>
>>>>
>>>> On 07/13/2017 08:47 PM, xiaoyan li wrote:
>>>>>
>>>>> Hi,
>>>>> I am concerned about the rocksdb impact on bluestore whole IO path. I
>>>>> did some test with bluestore fio plugin.
>>>>> For example, I got following data from the log when I did bluestore
>>>>> fio test with numjobs=64 and iopath=32. It seems that for every txc,
>>>>> most of the time spends on queued and commiting states.
>>>>> state time span(us)
>>>>> state_prepare_lat 386
>>>>> state_aio_wait_lat 430
>>>>> state_io_done_lat 0
>>>>> state_kv_queued_lat 7926
>>>>> state_kv_commiting_lat 30653
>>>>> state_kv_done_lat 4
>>>>>
>>>>>         "state_kv_queued_lat": {
>>>>>             "avgcount": 349076566,
>>>>>             "sum": 1214245.959889817,
>>>>>             "avgtime": 0.003478451
>>>>>         },
>>>>>         "state_kv_commiting_lat": {
>>>>>             "avgcount": 174538283,
>>>>>             "sum": 5612849.022306266,
>>>>>             "avgtime": 0.032158268
>>>>>         },
>>>>>
>>>>>
>>>>> And same time, to submit (174538283/3509556 = 49) txcs every time only
>>>>> takes 1024us, which is much less than commiting_lat 30653us.
>>>>>         "kv_lat": {
>>>>>             "avgcount": 3509556,
>>>>>             "sum": 3594.365142193,
>>>>>             "avgtime": 0.001024165
>>>>>         },
>>>>>
>>>>> The time between state_kv_queued_lat and state_kv_commiting_lat:
>>>>>
>>>>> https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L8349
>>>>>
>>>>> https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L8366
>>>>>
>>>>> https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L7741
>>>>>
>>>>> I am still investigating why it spends so long time on
>>>>> kv_commiting_lat, but from above data I doubt it is the problem of
>>>>> rocksdb.
>>>>> Please correct me if I misunderstood anything.
>>>>>
>>>>> Lisa
>>>>>
>>>>>
>>>>> On Thu, Jul 13, 2017 at 12:34 AM, Xiaoxi Chen <superdebuger@gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> FWIW, one thing that KVDB can provide is transaction support, which is
>>>>>> important as we need to update several metadata(Onode, allocator map,
>>>>>> and WAL for small write) transactionally.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2017-07-13 0:25 GMT+08:00 攀刘 <liupan1111@gmail.com>:
>>>>>>>
>>>>>>> Hi Sage,
>>>>>>>
>>>>>>> Indeed, I have an idea which hold a long time.
>>>>>>>
>>>>>>> Do we really need a heavy k/v database to store metadata? Especially
>>>>>>> for fast disks.... Introduce a third-party database also make
>>>>>>> difficulty for maintenance (maybe because of my limited database
>>>>>>> knowledge) ...
>>>>>>>
>>>>>>> Let's suppose:
>>>>>>> 1) The max pg number in one osd is limited(in my experience, 100~200
>>>>>>> pgs per osd is best performance)
>>>>>>> 2) The max number of objects in one pg is limited, because of disk
>>>>>>> space.
>>>>>>>
>>>>>>> Then, how about this: pre-allocate metadata locations in metadata
>>>>>>> partition。
>>>>>>>
>>>>>>> Part a SSD into two or three partitions(same as bluestore), instead of
>>>>>>> using kv database, just store metadata directly in one disk
>>>>>>> partition(we call it metadata partition). Inside this metadata
>>>>>>> partition, we store several data structures:
>>>>>>> 1) One hash table of PGs, key is PG id, value is another hash
>>>>>>> table(key is object index of this pg, value is object metadata, and
>>>>>>> object location in data partition).
>>>>>>> 2) A free object location list.
>>>>>>>
>>>>>>> And other extra things...
>>>>>>>
>>>>>>> The max pgs belongs to one OSD can be limited by options, so I believe
>>>>>>> the metadata partition should not be big. We could load all metadata
>>>>>>> into RAM if RAM is really big, or part of them and controlled by LRU,
>>>>>>> or just read, modify, and write back to disk when needed.
>>>>>>>
>>>>>>> Do you think this idea reasonable? At least, I believe this kind of
>>>>>>> new storage engine will be much faster.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Pan
>>>>>>>
>>>>>>> 2017-07-12 21:55 GMT+08:00 Sage Weil <sage@newdream.net>:
>>>>>>>>
>>>>>>>> On Wed, 12 Jul 2017, 攀刘 wrote:
>>>>>>>>>
>>>>>>>>> Hi Sage,
>>>>>>>>>
>>>>>>>>> Yes, I totally understand bluestore did much more things than a raw
>>>>>>>>> disk, but the current overhead is a little too big to our usage. I
>>>>>>>>> will compare bluestore with XFS(also has metadata tracking,
>>>>>>>>> allocation, and so on), and to see if XFS also has such impact.
>>>>>>>>>
>>>>>>>>> I would like to give a flamegraph later, but from the perfcounter, we
>>>>>>>>> could find most of time were spent in "kv_lat".
>>>>>>>>
>>>>>>>>
>>>>>>>> That's rocksdb.  And yeah, I think it's pretty clear that either
>>>>>>>> rocksdb
>>>>>>>> needs some serious work to really keep up with nvme (or optane) or
>>>>>>>> (more
>>>>>>>> likely) we need an alternate kv backend that is targetting high speed
>>>>>>>> flash.  I suspect the latter makes the most sense, and I believe there
>>>>>>>> are
>>>>>>>> various efforts at Intel looking at alternatives but no winner just
>>>>>>>> yet.
>>>>>>>>
>>>>>>>> Looking a bit further out, I think a new kv library that natively
>>>>>>>> targets
>>>>>>>> peristent memory (e.g., something built on pmem.io) will be the right
>>>>>>>> solution.  Although at that point, it's probbaly a question of whether
>>>>>>>> we
>>>>>>>> have pmem for metadata and 3D NAND for data or pure pmem; in the latter
>>>>>>>> case a complete replacement for bluestore would make more sense.
>>>>>>>>
>>>>>>>>> For FTL, yes, it is a good idea, after we get the flame graph, we
>>>>>>>>> could discuss which part could be improved by FTL, firmware, even open
>>>>>>>>> channel.
>>>>>>>>
>>>>>>>>
>>>>>>>> Yep!
>>>>>>>> sage
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2017-07-12 20:02 GMT+08:00 Sage Weil <sage@newdream.net>:
>>>>>>>>>>
>>>>>>>>>> On Wed, 12 Jul 2017, 攀刘 wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Cephers,
>>>>>>>>>>>
>>>>>>>>>>> I did some experiment today to compare the latency between one
>>>>>>>>>>> P3500(2T nvme SSD) and bluestore(fio + libfio_objectstore.so):
>>>>>>>>>>>
>>>>>>>>>>> For iodepth = 1, the random write latency of bluestore is 276.91us,
>>>>>>>>>>> compare with 14.71 of SSD, big overhead.
>>>>>>>>>>>
>>>>>>>>>>> I also test iodepth = 16, Still, there is a big overhead.(143 us ->
>>>>>>>>>>> 642 us)
>>>>>>>>>>>
>>>>>>>>>>> What is your opinion?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> There is a lot of work that bluestore is doing over the raw device as
>>>>>>>>>> it
>>>>>>>>>> is implementing all of the metadata tracking, checksumming,
>>>>>>>>>> allocation,
>>>>>>>>>> and so on.  There's definitely lots of room for improvement, but I'm
>>>>>>>>>> not sure you can expect to see latencies in the 10s of us.  That
>>>>>>>>>> said, it
>>>>>>>>>> would be interesting to see an updated flamegraph to see where the
>>>>>>>>>> time is
>>>>>>>>>> being spent and where we can slim this down.  On a new nvme it's
>>>>>>>>>> possible
>>>>>>>>>> we can do away with some of the complexity of, say, the allocator,
>>>>>>>>>> since
>>>>>>>>>> the FTL is performing a lot of the same work anyway.
>>>>>>>>>>
>>>>>>>>>> sage
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Best wishes
>>> Lisa

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2017-07-14 13:31 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-12  9:15 latency compare between 2t NVME SSD P3500 and bluestore 攀刘
2017-07-12 12:02 ` Sage Weil
2017-07-12 12:45   ` 攀刘
2017-07-12 13:55     ` Sage Weil
2017-07-12 16:25       ` 攀刘
2017-07-12 16:34         ` Xiaoxi Chen
2017-07-14  1:47           ` xiaoyan li
2017-07-14  1:54             ` Mark Nelson
2017-07-14  8:29               ` xiaoyan li
2017-07-14  9:31                 ` Xiaoxi Chen
2017-07-14 12:54                   ` 攀刘
2017-07-14 13:31                     ` Mark Nelson
2017-07-14  2:49             ` Ma, Jianpeng
2017-07-14  8:22               ` xiaoyan li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.