All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Blueprint:  Add LevelDB support to ceph cluster backend store
  2013-07-31  3:10 Blueprint: Add LevelDB support to ceph cluster backend store Haomai Wang
@ 2013-07-30 22:54 ` Alex Elsayed
  2013-07-31  5:56   ` Gregory Farnum
  2013-07-31  6:04   ` 袁冬
  2013-07-31  6:01 ` Sage Weil
  1 sibling, 2 replies; 10+ messages in thread
From: Alex Elsayed @ 2013-07-30 22:54 UTC (permalink / raw)
  To: ceph-devel

I posted this as a comment on the blueprint, but I figured I'd say it here:

The thing I'd worry about here is that LevelDB's performance (along with 
that of various other K/V stores) falls off a cliff for large values.

Symas (who make LMDB, used by OpenLDAP) did some benchmarking that shows 
drastic performance loss with 100KB values on both read and write: 
http://symas.com/mdb/microbench/#sec4

It's not just disk latency, either - an SSD showed the same behavior: 
http://symas.com/mdb/microbench/#sec7

I'd recommend REALLY careful benchmarking with a variety of loads (and value 
sizes).


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Blueprint:  Add LevelDB support to ceph cluster backend store
@ 2013-07-31  3:10 Haomai Wang
  2013-07-30 22:54 ` Alex Elsayed
  2013-07-31  6:01 ` Sage Weil
  0 siblings, 2 replies; 10+ messages in thread
From: Haomai Wang @ 2013-07-31  3:10 UTC (permalink / raw)
  To: ceph-devel

Every node of ceph cluster has a backend filesystem such as btrfs,
xfs and ext4 that provides storage for data objects, whose location
are determined by CRUSH algorithm. There should exists an abstract
interface sitting between osd and backend store, allowing different
backend store implementation. Currently, we only have general 
POSIX interface. LevelDB is a fast key-value storage library written at 
Google that provides an ordered mapping from string keys to string 
values. We could implement a LevelDB backend to support base 
operations correspond to POSIX operations.  LevelDB driver enables 
gateway to communicate with LevelDB to store objects on the node 
basis.


LevelDB driver is attractive by the folks who own a special use case 
such as a write-heave system. If we can abstract a general interface, 
we can choose other DBM if you find it more suitable, such as Kyoto 
Cabinet, BDB. Futhermore, we can choose backen store for each OSD
node. So we have different OSD type for special purpose.

Expected Results: Objects can be stored reliably to LevelDB. The IO 
performance and recovery process can be comparable to original 
stores. And for special case, LevelDB driver should have much better 
performance than local filesystem backend driver. The snapshot and
any features you think of are optional.

Best regards,
Wheats




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Blueprint: Add LevelDB support to ceph cluster backend store
  2013-07-30 22:54 ` Alex Elsayed
@ 2013-07-31  5:56   ` Gregory Farnum
  2013-07-31  6:04   ` 袁冬
  1 sibling, 0 replies; 10+ messages in thread
From: Gregory Farnum @ 2013-07-31  5:56 UTC (permalink / raw)
  To: Alex Elsayed, haomaiwang; +Cc: ceph-devel

On Tue, Jul 30, 2013 at 3:54 PM, Alex Elsayed <eternaleye@gmail.com> wrote:
> I posted this as a comment on the blueprint, but I figured I'd say it here:
>
> The thing I'd worry about here is that LevelDB's performance (along with
> that of various other K/V stores) falls off a cliff for large values.
>
> Symas (who make LMDB, used by OpenLDAP) did some benchmarking that shows
> drastic performance loss with 100KB values on both read and write:
> http://symas.com/mdb/microbench/#sec4
>
> It's not just disk latency, either - an SSD showed the same behavior:
> http://symas.com/mdb/microbench/#sec7
>
> I'd recommend REALLY careful benchmarking with a variety of loads (and value
> sizes).

There are various users of leveldb who have tuned it more for
workloads like this; Ryak has some stuff (not sure how much) and I
believe HyperDex has some code changes that do a bunch but include
better support for large writes.
One thing to keep in mind is that we do already have leveldb in the
OSD; it uses that for "omap" and keeping track of a lot of object
metadata and lookaside stuff. I've asked before about using leveldb as
a backing store and the big trouble with it is that it assumes it's
feasible to copy the values it stores several times; with 4MB objects
it really isn't. That doesn't mean it can't be appropriate for other
kinds of workloads, though, and there are several interface layers for
providing a backing store that could make this pluggable.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Blueprint:  Add LevelDB support to ceph cluster backend store
  2013-07-31  3:10 Blueprint: Add LevelDB support to ceph cluster backend store Haomai Wang
  2013-07-30 22:54 ` Alex Elsayed
@ 2013-07-31  6:01 ` Sage Weil
  2013-07-31  6:38   ` Haomai Wang
  1 sibling, 1 reply; 10+ messages in thread
From: Sage Weil @ 2013-07-31  6:01 UTC (permalink / raw)
  To: Haomai Wang; +Cc: ceph-devel

Hi Haomai,

On Wed, 31 Jul 2013, Haomai Wang wrote:
> Every node of ceph cluster has a backend filesystem such as btrfs,
> xfs and ext4 that provides storage for data objects, whose location
> are determined by CRUSH algorithm. There should exists an abstract
> interface sitting between osd and backend store, allowing different
> backend store implementation. Currently, we only have general 
> POSIX interface. LevelDB is a fast key-value storage library written at 
> Google that provides an ordered mapping from string keys to string 
> values. We could implement a LevelDB backend to support base 
> operations correspond to POSIX operations.  LevelDB driver enables 
> gateway to communicate with LevelDB to store objects on the node 
> basis.
> 
> 
> LevelDB driver is attractive by the folks who own a special use case 
> such as a write-heave system. If we can abstract a general interface, 
> we can choose other DBM if you find it more suitable, such as Kyoto 
> Cabinet, BDB. Futhermore, we can choose backen store for each OSD
> node. So we have different OSD type for special purpose.
> 
> Expected Results: Objects can be stored reliably to LevelDB. The IO 
> performance and recovery process can be comparable to original 
> stores. And for special case, LevelDB driver should have much better 
> performance than local filesystem backend driver. The snapshot and
> any features you think of are optional.

I added a comment in the wiki, but I'll reply here.

Much of what you're talking about is already in place:

 - There is an ObjectStore.h abstraction of the local storage.  The only 
   up to date implementation is FileStore, which uses a combination 
   of a local file system and leveldb, but other backends have been used 
   in the past, and new ones can we easily added in.

 - We currently use leveldb for the 'omap' component of rados objects.  
   That is, each rados object has a bytestream portion (like a file), 
   attr (like extended attributes), and an omap (keys/values).  All of 
   none of those interfaces can be used for any given object, although 
   most users only use one interface at a time.  The main limitation here 
   if you want to use leveldb only is that we still have an inode in the 
   file system to represent each object, even when it contains only 
   key/value pairs.

 - The use of leveldb itself is also well abstracted by a KeyValueDB 
   interface, so other key/value libraries could be swapped in in its 
   place.  The main other component is a middle layer that wraps the kv 
   store to provide copy-on-write type semantics for each object's set of 
   keys (to facilitate the snapshot functionality in rados/ceph).

If you have a workload that you want to be purgely key/value based, it 
would be possible to write a much simpler ObjectStore implementation that 
ignores or trivially implements the byte and attr portions of the object 
in leveldb (or the KeyValueDB abstraction).  It would have very different 
performance characteristics than what we're doing now, of course.  You 
might also be interested in looking at the HyperLevelDB project, which is 
a fork of leveldb that focuses on multithreading and compaction 
performance.

We've heard from other people who are interested in wiring different 
key/value backends into the OSD, so any work to make it easier to do that 
would be great!

sage

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Blueprint: Add LevelDB support to ceph cluster backend store
  2013-07-30 22:54 ` Alex Elsayed
  2013-07-31  5:56   ` Gregory Farnum
@ 2013-07-31  6:04   ` 袁冬
  2013-07-31  6:07     ` 袁冬
  1 sibling, 1 reply; 10+ messages in thread
From: 袁冬 @ 2013-07-31  6:04 UTC (permalink / raw)
  To: Alex Elsayed; +Cc: ceph-devel

We have the same idea and already tested the LevelDB Performance VS
Btrfs.  The result is negative, especially for big block IO.

                                                  1KB Block     4KB
Block    8KB Block     128KB Block    1MB Block
LevelDB with Compress:        1.77MB/s       5.15MB/s     6.44MB/s
    7.64MB/s        13.61MB/s
LevelDB without Compress:   1.12MB/s       3.21MB/s     4.57MB/s
  7.28MB/s        13.28MB/s
Btrfs                                           13.84MB/s
12.96MB/s  18.29MB/s       95.26MB/s      109.23MB/s

On 31 July 2013 06:54, Alex Elsayed <eternaleye@gmail.com> wrote:
> I posted this as a comment on the blueprint, but I figured I'd say it here:
>
> The thing I'd worry about here is that LevelDB's performance (along with
> that of various other K/V stores) falls off a cliff for large values.
>
> Symas (who make LMDB, used by OpenLDAP) did some benchmarking that shows
> drastic performance loss with 100KB values on both read and write:
> http://symas.com/mdb/microbench/#sec4
>
> It's not just disk latency, either - an SSD showed the same behavior:
> http://symas.com/mdb/microbench/#sec7
>
> I'd recommend REALLY careful benchmarking with a variety of loads (and value
> sizes).
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Dong Yuan
Email:yuandong1222@gmail.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Blueprint: Add LevelDB support to ceph cluster backend store
  2013-07-31  6:04   ` 袁冬
@ 2013-07-31  6:07     ` 袁冬
  0 siblings, 0 replies; 10+ messages in thread
From: 袁冬 @ 2013-07-31  6:07 UTC (permalink / raw)
  To: Alex Elsayed; +Cc: ceph-devel

A better format result:

1KB Block
LevelDB with Compress: 1.77MB/s
LevelDB without Compress: 1.12MB/s
Btrfs: 13.84MB/s

4KB Block
LevelDB with Compress: 5.15MB/s
LevelDB without Compress: 3.21MB/s
Btrfs: 12.96MB/s

8KB Block
LevelDB with Compress:  6.44MB/s
LevelDB without Compress: 4.57MB/s
Btrfs: 18.29MB/s

128KB Block
LevelDB with Compress:  7.64MB/s
LevelDB without Compress: 7.28MB/s
Btrfs:  95.26MB/s

1MB Block
LevelDB with Compress:  13.61MB/s
LevelDB without Compress: 13.28MB/s
Btrfs:  109.23MB/s

On 31 July 2013 14:04, 袁冬 <yuandong1222@gmail.com> wrote:
> We have the same idea and already tested the LevelDB Performance VS
> Btrfs.  The result is negative, especially for big block IO.
>
>                                                   1KB Block     4KB
> Block    8KB Block     128KB Block    1MB Block
> LevelDB with Compress:        1.77MB/s       5.15MB/s     6.44MB/s
>     7.64MB/s        13.61MB/s
> LevelDB without Compress:   1.12MB/s       3.21MB/s     4.57MB/s
>   7.28MB/s        13.28MB/s
> Btrfs                                           13.84MB/s
> 12.96MB/s  18.29MB/s       95.26MB/s      109.23MB/s
>
> On 31 July 2013 06:54, Alex Elsayed <eternaleye@gmail.com> wrote:
>> I posted this as a comment on the blueprint, but I figured I'd say it here:
>>
>> The thing I'd worry about here is that LevelDB's performance (along with
>> that of various other K/V stores) falls off a cliff for large values.
>>
>> Symas (who make LMDB, used by OpenLDAP) did some benchmarking that shows
>> drastic performance loss with 100KB values on both read and write:
>> http://symas.com/mdb/microbench/#sec4
>>
>> It's not just disk latency, either - an SSD showed the same behavior:
>> http://symas.com/mdb/microbench/#sec7
>>
>> I'd recommend REALLY careful benchmarking with a variety of loads (and value
>> sizes).
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Dong Yuan
> Email:yuandong1222@gmail.com



-- 
Dong Yuan
Email:yuandong1222@gmail.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Blueprint:  Add LevelDB support to ceph cluster backend store
  2013-07-31  6:01 ` Sage Weil
@ 2013-07-31  6:38   ` Haomai Wang
  2013-08-27 23:01     ` Sage Weil
  0 siblings, 1 reply; 10+ messages in thread
From: Haomai Wang @ 2013-07-31  6:38 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel


2013-7-31, 2:01, Sage Weil <sage@inktank.com> wrote:

> Hi Haomai,
> 
> On Wed, 31 Jul 2013, Haomai Wang wrote:
>> Every node of ceph cluster has a backend filesystem such as btrfs,
>> xfs and ext4 that provides storage for data objects, whose location
>> are determined by CRUSH algorithm. There should exists an abstract
>> interface sitting between osd and backend store, allowing different
>> backend store implementation. Currently, we only have general 
>> POSIX interface. LevelDB is a fast key-value storage library written at 
>> Google that provides an ordered mapping from string keys to string 
>> values. We could implement a LevelDB backend to support base 
>> operations correspond to POSIX operations.  LevelDB driver enables 
>> gateway to communicate with LevelDB to store objects on the node 
>> basis.
>> 
>> 
>> LevelDB driver is attractive by the folks who own a special use case 
>> such as a write-heave system. If we can abstract a general interface, 
>> we can choose other DBM if you find it more suitable, such as Kyoto 
>> Cabinet, BDB. Futhermore, we can choose backen store for each OSD
>> node. So we have different OSD type for special purpose.
>> 
>> Expected Results: Objects can be stored reliably to LevelDB. The IO 
>> performance and recovery process can be comparable to original 
>> stores. And for special case, LevelDB driver should have much better 
>> performance than local filesystem backend driver. The snapshot and
>> any features you think of are optional.
> 
> I added a comment in the wiki, but I'll reply here.
> 
> Much of what you're talking about is already in place:
> 
> - There is an ObjectStore.h abstraction of the local storage.  The only 
>   up to date implementation is FileStore, which uses a combination 
>   of a local file system and leveldb, but other backends have been used 
>   in the past, and new ones can we easily added in.
> 
> - We currently use leveldb for the 'omap' component of rados objects.  
>   That is, each rados object has a bytestream portion (like a file), 
>   attr (like extended attributes), and an omap (keys/values).  All of 
>   none of those interfaces can be used for any given object, although 
>   most users only use one interface at a time.  The main limitation here 
>   if you want to use leveldb only is that we still have an inode in the 
>   file system to represent each object, even when it contains only 
>   key/value pairs.
> 
> - The use of leveldb itself is also well abstracted by a KeyValueDB 
>   interface, so other key/value libraries could be swapped in in its 
>   place.  The main other component is a middle layer that wraps the kv 
>   store to provide copy-on-write type semantics for each object's set of 
>   keys (to facilitate the snapshot functionality in rados/ceph).
> 
> If you have a workload that you want to be purgely key/value based, it 
> would be possible to write a much simpler ObjectStore implementation that 
> ignores or trivially implements the byte and attr portions of the object 
> in leveldb (or the KeyValueDB abstraction).  It would have very different 
> performance characteristics than what we're doing now, of course.  You 
> might also be interested in looking at the HyperLevelDB project, which is 
> a fork of leveldb that focuses on multithreading and compaction 
> performance.
I'm happy to hear it. 

I think there may exists one thing you may leave out.  If we abstract a unified
or more different interfaces, we can allow different pool to use in different
situation. For example, there exists two LevelDB backend OSD nodes forming
up a distributed k/v store, three Btrfs OSD nodes forming up a traditional use case.
More imaging space will be given to users.
> 
> We've heard from other people who are interested in wiring different 
> key/value backends into the OSD, so any work to make it easier to do that 
> would be great!
> 
> sage

Best regards,
Wheats



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Blueprint:  Add LevelDB support to ceph cluster backend store
  2013-07-31  6:38   ` Haomai Wang
@ 2013-08-27 23:01     ` Sage Weil
  2013-08-28 14:12       ` Haomai Wang
  0 siblings, 1 reply; 10+ messages in thread
From: Sage Weil @ 2013-08-27 23:01 UTC (permalink / raw)
  To: Haomai Wang; +Cc: ceph-devel

Hi Haomai,

I just wanted to check in to see if things have progressed at all since we 
talked at CDS.  If you have any questions or there is anything I can 
help with, let me know!  I'd love to see this alternative backend make it 
into Emperor.

Thanks!
sage


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Blueprint:  Add LevelDB support to ceph cluster backend store
  2013-08-27 23:01     ` Sage Weil
@ 2013-08-28 14:12       ` Haomai Wang
  2013-08-28 16:17         ` Sage Weil
  0 siblings, 1 reply; 10+ messages in thread
From: Haomai Wang @ 2013-08-28 14:12 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel


On Aug 28, 2013, at 7:01 AM, Sage Weil <sage@inktank.com> wrote:

> Hi Haomai,
> 
> I just wanted to check in to see if things have progressed at all since we 
> talked at CDS.  If you have any questions or there is anything I can 
> help with, let me know!  I'd love to see this alternative backend make it 
> into Emperor.
Yes, I'm ready to do it. May I ask about how to register bp to redmine? Is it
true to do it directly? Can I follow a example bp?
> 
> Thanks!
> sage
> 

Best regards,
Wheats




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Blueprint:  Add LevelDB support to ceph cluster backend store
  2013-08-28 14:12       ` Haomai Wang
@ 2013-08-28 16:17         ` Sage Weil
  0 siblings, 0 replies; 10+ messages in thread
From: Sage Weil @ 2013-08-28 16:17 UTC (permalink / raw)
  To: Haomai Wang; +Cc: ceph-devel

On Wed, 28 Aug 2013, Haomai Wang wrote:
> 
> On Aug 28, 2013, at 7:01 AM, Sage Weil <sage@inktank.com> wrote:
> 
> > Hi Haomai,
> > 
> > I just wanted to check in to see if things have progressed at all since we 
> > talked at CDS.  If you have any questions or there is anything I can 
> > help with, let me know!  I'd love to see this alternative backend make it 
> > into Emperor.
> Yes, I'm ready to do it. May I ask about how to register bp to redmine? Is it
> true to do it directly? Can I follow a example bp?

There is no magic connection between the blueprints and redmine (yet).  
Just create a redmine account (if you haven't already) and open a Feature 
ticket, and cut&paste or link back to the blueprint.  (I've added you to 
the developer group which lets you do a nubmer of things that you couldn't 
before.

sage

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2013-08-28 16:17 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-31  3:10 Blueprint: Add LevelDB support to ceph cluster backend store Haomai Wang
2013-07-30 22:54 ` Alex Elsayed
2013-07-31  5:56   ` Gregory Farnum
2013-07-31  6:04   ` 袁冬
2013-07-31  6:07     ` 袁冬
2013-07-31  6:01 ` Sage Weil
2013-07-31  6:38   ` Haomai Wang
2013-08-27 23:01     ` Sage Weil
2013-08-28 14:12       ` Haomai Wang
2013-08-28 16:17         ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.