All of lore.kernel.org
 help / color / mirror / Atom feed
* Initial newstore vs filestore results
@ 2015-04-07 14:57 Mark Nelson
  2015-04-07 19:16 ` Mark Nelson
  0 siblings, 1 reply; 28+ messages in thread
From: Mark Nelson @ 2015-04-07 14:57 UTC (permalink / raw)
  To: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 661 bytes --]

Hi Guys,

I ran some quick tests on Sage's newstore branch.  So far given that 
this is a prototype, things are looking pretty good imho.  The 4MB 
object rados bench read/write and small read performance looks 
especially good.  Keep in mind that this is not using the SSD journals 
in any way, so 640MB/s sequential writes is actually really good 
compared to filestore without SSD journals.

small write performance appears to be fairly bad, especially in the RBD 
case where it's small writes to larger objects.  I'm going to sit down 
and see if I can figure out what's going on.  It's bad enough that I 
suspect there's just something odd going on.

Mark

[-- Attachment #2: newstore_vs_filestore.ods --]
[-- Type: application/vnd.oasis.opendocument.spreadsheet, Size: 50545 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Initial newstore vs filestore results
  2015-04-07 14:57 Initial newstore vs filestore results Mark Nelson
@ 2015-04-07 19:16 ` Mark Nelson
  2015-04-08  1:45   ` Mark Nelson
  0 siblings, 1 reply; 28+ messages in thread
From: Mark Nelson @ 2015-04-07 19:16 UTC (permalink / raw)
  To: ceph-devel

On 04/07/2015 09:57 AM, Mark Nelson wrote:
> Hi Guys,
>
> I ran some quick tests on Sage's newstore branch.  So far given that
> this is a prototype, things are looking pretty good imho.  The 4MB
> object rados bench read/write and small read performance looks
> especially good.  Keep in mind that this is not using the SSD journals
> in any way, so 640MB/s sequential writes is actually really good
> compared to filestore without SSD journals.
>
> small write performance appears to be fairly bad, especially in the RBD
> case where it's small writes to larger objects.  I'm going to sit down
> and see if I can figure out what's going on.  It's bad enough that I
> suspect there's just something odd going on.
>
> Mark

Seekwatcher/blktrace graphs of a 4 OSD cluster using newstore for those 
interested:

http://nhm.ceph.com/newstore/

Interestingly small object write/read performance with 4 OSDs was about 
1/3-1/4 the speed of the same cluster with 36 OSDs.

Note: Thanks Dan for fixing the directory column width!

Mark

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Initial newstore vs filestore results
  2015-04-07 19:16 ` Mark Nelson
@ 2015-04-08  1:45   ` Mark Nelson
  2015-04-08  1:48     ` Somnath Roy
  2015-04-08  2:58     ` Sage Weil
  0 siblings, 2 replies; 28+ messages in thread
From: Mark Nelson @ 2015-04-08  1:45 UTC (permalink / raw)
  To: ceph-devel



On 04/07/2015 02:16 PM, Mark Nelson wrote:
> On 04/07/2015 09:57 AM, Mark Nelson wrote:
>> Hi Guys,
>>
>> I ran some quick tests on Sage's newstore branch.  So far given that
>> this is a prototype, things are looking pretty good imho.  The 4MB
>> object rados bench read/write and small read performance looks
>> especially good.  Keep in mind that this is not using the SSD journals
>> in any way, so 640MB/s sequential writes is actually really good
>> compared to filestore without SSD journals.
>>
>> small write performance appears to be fairly bad, especially in the RBD
>> case where it's small writes to larger objects.  I'm going to sit down
>> and see if I can figure out what's going on.  It's bad enough that I
>> suspect there's just something odd going on.
>>
>> Mark
>
> Seekwatcher/blktrace graphs of a 4 OSD cluster using newstore for those
> interested:
>
> http://nhm.ceph.com/newstore/
>
> Interestingly small object write/read performance with 4 OSDs was about
> 1/3-1/4 the speed of the same cluster with 36 OSDs.
>
> Note: Thanks Dan for fixing the directory column width!
>
> Mark

New fio/librbd results using Sage's latest code that attempts to keep 
small overwrite extents in the db.  This is 4 OSD so not directly 
comparable to the 36 OSD tests above, but does include seekwatcher 
graphs.  Results in MB/s:

	write	read	randw	randr
4MB	57.9	319.6	55.2	285.9
128KB	2.5	230.6	2.4	125.4
4KB	0.46	55.65	1.11	3.56

Seekwatcher graphs:

http://nhm.ceph.com/newstore/20150407/

Mark

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Initial newstore vs filestore results
  2015-04-08  1:45   ` Mark Nelson
@ 2015-04-08  1:48     ` Somnath Roy
  2015-04-08  1:53       ` Mark Nelson
  2015-04-08  2:58     ` Sage Weil
  1 sibling, 1 reply; 28+ messages in thread
From: Somnath Roy @ 2015-04-08  1:48 UTC (permalink / raw)
  To: Mark Nelson, ceph-devel

Mark,
Could you please send the instruction out on how to use this new store?

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Tuesday, April 07, 2015 6:46 PM
To: ceph-devel
Subject: Re: Initial newstore vs filestore results



On 04/07/2015 02:16 PM, Mark Nelson wrote:
> On 04/07/2015 09:57 AM, Mark Nelson wrote:
>> Hi Guys,
>>
>> I ran some quick tests on Sage's newstore branch.  So far given that
>> this is a prototype, things are looking pretty good imho.  The 4MB
>> object rados bench read/write and small read performance looks
>> especially good.  Keep in mind that this is not using the SSD
>> journals in any way, so 640MB/s sequential writes is actually really
>> good compared to filestore without SSD journals.
>>
>> small write performance appears to be fairly bad, especially in the
>> RBD case where it's small writes to larger objects.  I'm going to sit
>> down and see if I can figure out what's going on.  It's bad enough
>> that I suspect there's just something odd going on.
>>
>> Mark
>
> Seekwatcher/blktrace graphs of a 4 OSD cluster using newstore for
> those
> interested:
>
> http://nhm.ceph.com/newstore/
>
> Interestingly small object write/read performance with 4 OSDs was
> about
> 1/3-1/4 the speed of the same cluster with 36 OSDs.
>
> Note: Thanks Dan for fixing the directory column width!
>
> Mark

New fio/librbd results using Sage's latest code that attempts to keep small overwrite extents in the db.  This is 4 OSD so not directly comparable to the 36 OSD tests above, but does include seekwatcher graphs.  Results in MB/s:

        write   read    randw   randr
4MB     57.9    319.6   55.2    285.9
128KB   2.5     230.6   2.4     125.4
4KB     0.46    55.65   1.11    3.56

Seekwatcher graphs:

http://nhm.ceph.com/newstore/20150407/

Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Initial newstore vs filestore results
  2015-04-08  1:48     ` Somnath Roy
@ 2015-04-08  1:53       ` Mark Nelson
  2015-04-08  2:26         ` Chen, Xiaoxi
  0 siblings, 1 reply; 28+ messages in thread
From: Mark Nelson @ 2015-04-08  1:53 UTC (permalink / raw)
  To: Somnath Roy, ceph-devel

Hi Somnath,

Sure.  It's very easy:

1) install or build wip-newstore
2) Add the following to your ceph.conf file:

enable experimental unrecoverable data corrupting features = newstore 
rocksdb
osd objectstore = newstore

Lots of interesting things to dig into!

Mark

On 04/07/2015 08:48 PM, Somnath Roy wrote:
> Mark,
> Could you please send the instruction out on how to use this new store?
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Tuesday, April 07, 2015 6:46 PM
> To: ceph-devel
> Subject: Re: Initial newstore vs filestore results
>
>
>
> On 04/07/2015 02:16 PM, Mark Nelson wrote:
>> On 04/07/2015 09:57 AM, Mark Nelson wrote:
>>> Hi Guys,
>>>
>>> I ran some quick tests on Sage's newstore branch.  So far given that
>>> this is a prototype, things are looking pretty good imho.  The 4MB
>>> object rados bench read/write and small read performance looks
>>> especially good.  Keep in mind that this is not using the SSD
>>> journals in any way, so 640MB/s sequential writes is actually really
>>> good compared to filestore without SSD journals.
>>>
>>> small write performance appears to be fairly bad, especially in the
>>> RBD case where it's small writes to larger objects.  I'm going to sit
>>> down and see if I can figure out what's going on.  It's bad enough
>>> that I suspect there's just something odd going on.
>>>
>>> Mark
>>
>> Seekwatcher/blktrace graphs of a 4 OSD cluster using newstore for
>> those
>> interested:
>>
>> http://nhm.ceph.com/newstore/
>>
>> Interestingly small object write/read performance with 4 OSDs was
>> about
>> 1/3-1/4 the speed of the same cluster with 36 OSDs.
>>
>> Note: Thanks Dan for fixing the directory column width!
>>
>> Mark
>
> New fio/librbd results using Sage's latest code that attempts to keep small overwrite extents in the db.  This is 4 OSD so not directly comparable to the 36 OSD tests above, but does include seekwatcher graphs.  Results in MB/s:
>
>          write   read    randw   randr
> 4MB     57.9    319.6   55.2    285.9
> 128KB   2.5     230.6   2.4     125.4
> 4KB     0.46    55.65   1.11    3.56
>
> Seekwatcher graphs:
>
> http://nhm.ceph.com/newstore/20150407/
>
> Mark
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Initial newstore vs filestore results
  2015-04-08  1:53       ` Mark Nelson
@ 2015-04-08  2:26         ` Chen, Xiaoxi
  0 siblings, 0 replies; 28+ messages in thread
From: Chen, Xiaoxi @ 2015-04-08  2:26 UTC (permalink / raw)
  To: Mark Nelson, Somnath Roy, ceph-devel

Hi mark, 

Really thanks for the data.

Not sure if this PR will be merged soon (https://github.com/ceph/ceph/pull/4266)

Some  known bugs around :
       `rados ls` will cause assert fault (which was fix by the PR) 
        `rbd list` will also cause assert failure (because omap_iter hasn’t implemented yet).
        PG will not back to active+clean after restart OSD. (wip)

The most performance related part seems about the newstore_fsync_threads and rocksdb tuning
         The small rados write performance degradation should related with fsync sice it's in creation mode that will not go through rocksdb.
         The RBD random write case should related with rocksdb since the compaction overhead(and also the WAL log there).

                                                                                                    Xiaoxi
-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Wednesday, April 8, 2015 9:53 AM
To: Somnath Roy; ceph-devel
Subject: Re: Initial newstore vs filestore results

Hi Somnath,

Sure.  It's very easy:

1) install or build wip-newstore
2) Add the following to your ceph.conf file:

enable experimental unrecoverable data corrupting features = newstore rocksdb osd objectstore = newstore

Lots of interesting things to dig into!

Mark

On 04/07/2015 08:48 PM, Somnath Roy wrote:
> Mark,
> Could you please send the instruction out on how to use this new store?
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org 
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Tuesday, April 07, 2015 6:46 PM
> To: ceph-devel
> Subject: Re: Initial newstore vs filestore results
>
>
>
> On 04/07/2015 02:16 PM, Mark Nelson wrote:
>> On 04/07/2015 09:57 AM, Mark Nelson wrote:
>>> Hi Guys,
>>>
>>> I ran some quick tests on Sage's newstore branch.  So far given that 
>>> this is a prototype, things are looking pretty good imho.  The 4MB 
>>> object rados bench read/write and small read performance looks 
>>> especially good.  Keep in mind that this is not using the SSD 
>>> journals in any way, so 640MB/s sequential writes is actually really 
>>> good compared to filestore without SSD journals.
>>>
>>> small write performance appears to be fairly bad, especially in the 
>>> RBD case where it's small writes to larger objects.  I'm going to 
>>> sit down and see if I can figure out what's going on.  It's bad 
>>> enough that I suspect there's just something odd going on.
>>>
>>> Mark
>>
>> Seekwatcher/blktrace graphs of a 4 OSD cluster using newstore for 
>> those
>> interested:
>>
>> http://nhm.ceph.com/newstore/
>>
>> Interestingly small object write/read performance with 4 OSDs was 
>> about
>> 1/3-1/4 the speed of the same cluster with 36 OSDs.
>>
>> Note: Thanks Dan for fixing the directory column width!
>>
>> Mark
>
> New fio/librbd results using Sage's latest code that attempts to keep small overwrite extents in the db.  This is 4 OSD so not directly comparable to the 36 OSD tests above, but does include seekwatcher graphs.  Results in MB/s:
>
>          write   read    randw   randr
> 4MB     57.9    319.6   55.2    285.9
> 128KB   2.5     230.6   2.4     125.4
> 4KB     0.46    55.65   1.11    3.56
>
> Seekwatcher graphs:
>
> http://nhm.ceph.com/newstore/20150407/
>
> Mark
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Initial newstore vs filestore results
  2015-04-08  1:45   ` Mark Nelson
  2015-04-08  1:48     ` Somnath Roy
@ 2015-04-08  2:58     ` Sage Weil
  2015-04-08  7:24       ` Haomai Wang
                         ` (2 more replies)
  1 sibling, 3 replies; 28+ messages in thread
From: Sage Weil @ 2015-04-08  2:58 UTC (permalink / raw)
  To: Mark Nelson; +Cc: ceph-devel

On Tue, 7 Apr 2015, Mark Nelson wrote:
> On 04/07/2015 02:16 PM, Mark Nelson wrote:
> > On 04/07/2015 09:57 AM, Mark Nelson wrote:
> > > Hi Guys,
> > > 
> > > I ran some quick tests on Sage's newstore branch.  So far given that
> > > this is a prototype, things are looking pretty good imho.  The 4MB
> > > object rados bench read/write and small read performance looks
> > > especially good.  Keep in mind that this is not using the SSD journals
> > > in any way, so 640MB/s sequential writes is actually really good
> > > compared to filestore without SSD journals.
> > > 
> > > small write performance appears to be fairly bad, especially in the RBD
> > > case where it's small writes to larger objects.  I'm going to sit down
> > > and see if I can figure out what's going on.  It's bad enough that I
> > > suspect there's just something odd going on.
> > > 
> > > Mark
> > 
> > Seekwatcher/blktrace graphs of a 4 OSD cluster using newstore for those
> > interested:
> > 
> > http://nhm.ceph.com/newstore/
> > 
> > Interestingly small object write/read performance with 4 OSDs was about
> > 1/3-1/4 the speed of the same cluster with 36 OSDs.
> > 
> > Note: Thanks Dan for fixing the directory column width!
> > 
> > Mark
> 
> New fio/librbd results using Sage's latest code that attempts to keep small
> overwrite extents in the db.  This is 4 OSD so not directly comparable to the
> 36 OSD tests above, but does include seekwatcher graphs.  Results in MB/s:
> 
> 	write	read	randw	randr
> 4MB	57.9	319.6	55.2	285.9
> 128KB	2.5	230.6	2.4	125.4
> 4KB	0.46	55.65	1.11	3.56

What would be very interesting would be to see the 4KB performance 
with the defaults (newstore overlay max = 32) vs overlays disabled 
(newstore overlay max = 0) and see if/how much it is helping.

The latest branch also has open-by-handle.  It's on by default (newstore 
open by handle = true).  I think for most workloads it won't be very 
noticeable... I think there are two questions we need to answer though:

1) Does it have any impact on a creation workload (say, 4kb objects).  It 
shouldn't, but we should confirm.

2) Does it impact small object random reads with a cold cache.  I think to 
see the effect we'll probably need to pile a ton of objects into the 
store, drop caches, and then do random reads.  In the best case the 
effect will be small, but hopefully noticeable: we should go from 
a directory lookup (1+ seeks) + inode lookup (1+ seek) + data 
read, to inode lookup (1+ seek) + data read.  So, 3 -> 2 seeks best case?  
I'm not really sure what XFS is doing under the covers here...

sage

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Initial newstore vs filestore results
  2015-04-08  2:58     ` Sage Weil
@ 2015-04-08  7:24       ` Haomai Wang
  2015-04-08 16:49         ` Sage Weil
  2015-04-08 14:38       ` Mark Nelson
  2015-04-09  3:19       ` Mark Nelson
  2 siblings, 1 reply; 28+ messages in thread
From: Haomai Wang @ 2015-04-08  7:24 UTC (permalink / raw)
  To: Sage Weil; +Cc: Mark Nelson, ceph-devel

On Wed, Apr 8, 2015 at 10:58 AM, Sage Weil <sage@newdream.net> wrote:
> On Tue, 7 Apr 2015, Mark Nelson wrote:
>> On 04/07/2015 02:16 PM, Mark Nelson wrote:
>> > On 04/07/2015 09:57 AM, Mark Nelson wrote:
>> > > Hi Guys,
>> > >
>> > > I ran some quick tests on Sage's newstore branch.  So far given that
>> > > this is a prototype, things are looking pretty good imho.  The 4MB
>> > > object rados bench read/write and small read performance looks
>> > > especially good.  Keep in mind that this is not using the SSD journals
>> > > in any way, so 640MB/s sequential writes is actually really good
>> > > compared to filestore without SSD journals.
>> > >
>> > > small write performance appears to be fairly bad, especially in the RBD
>> > > case where it's small writes to larger objects.  I'm going to sit down
>> > > and see if I can figure out what's going on.  It's bad enough that I
>> > > suspect there's just something odd going on.
>> > >
>> > > Mark
>> >
>> > Seekwatcher/blktrace graphs of a 4 OSD cluster using newstore for those
>> > interested:
>> >
>> > http://nhm.ceph.com/newstore/
>> >
>> > Interestingly small object write/read performance with 4 OSDs was about
>> > 1/3-1/4 the speed of the same cluster with 36 OSDs.
>> >
>> > Note: Thanks Dan for fixing the directory column width!
>> >
>> > Mark
>>
>> New fio/librbd results using Sage's latest code that attempts to keep small
>> overwrite extents in the db.  This is 4 OSD so not directly comparable to the
>> 36 OSD tests above, but does include seekwatcher graphs.  Results in MB/s:
>>
>>       write   read    randw   randr
>> 4MB   57.9    319.6   55.2    285.9
>> 128KB 2.5     230.6   2.4     125.4
>> 4KB   0.46    55.65   1.11    3.56
>
> What would be very interesting would be to see the 4KB performance
> with the defaults (newstore overlay max = 32) vs overlays disabled
> (newstore overlay max = 0) and see if/how much it is helping.
>
> The latest branch also has open-by-handle.  It's on by default (newstore
> open by handle = true).  I think for most workloads it won't be very
> noticeable... I think there are two questions we need to answer though:
>
> 1) Does it have any impact on a creation workload (say, 4kb objects).  It
> shouldn't, but we should confirm.
>
> 2) Does it impact small object random reads with a cold cache.  I think to
> see the effect we'll probably need to pile a ton of objects into the
> store, drop caches, and then do random reads.  In the best case the
> effect will be small, but hopefully noticeable: we should go from
> a directory lookup (1+ seeks) + inode lookup (1+ seek) + data
> read, to inode lookup (1+ seek) + data read.  So, 3 -> 2 seeks best case?
> I'm not really sure what XFS is doing under the covers here..

WOW, it's really a cool implementation beyond my original mind
according to blueprint. Handler, overlay_map and data_map looks so
flexible and make small io cheaper in theory. Now we only have 1
element in data_map and I'm not sure your goal about the future's
usage. Although I have a unclearly idea that it could enhance the role
of NewStore and make local filesystem just as a block space allocator.
Let NewStore own a variable of FTL(File Translation Layer), so many
cool features could be added. What's your idea about data_map?

My concern currently still is WAL after fsync and kv commiting, maybe
fsync process is just fine because mostly we won't meet this case in
rbd. But submit sync kv transaction isn't a low latency job I think,
maybe we could let WAL parallel with kv commiting?(yes, I really
concern the latency of one op :-) )

Then from the actual rados write op, it will add setattr and
omap_setkeys ops. Current NewStore looks plays badly for setattr. It
always encode all xattrs(and other not so tiny fields) and write again
(Is this true?) though it could batch multi transaction's onode write
in short time.

NewStore also employ much more workload to KeyValueDB compared to
FileStore, so maybe we need to consider the rich workload again
compared before. FileStore only use leveldb just for write workload
mainly so leveldb could fit into greatly, but currently overlay
keys(read) and onode(read) will occur a main latency source in normal
IO I think. The default kvdb like leveldb and rocksdb both plays not
well for random read workload, it maybe will be problem. Looking for
another kv db maybe a choice.

And it still doesn't add journal codes for wal?

Anyway, NewStore should cover more workloads compared to FileStore. Good job!

>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Initial newstore vs filestore results
  2015-04-08  2:58     ` Sage Weil
  2015-04-08  7:24       ` Haomai Wang
@ 2015-04-08 14:38       ` Mark Nelson
  2015-04-09  3:19       ` Mark Nelson
  2 siblings, 0 replies; 28+ messages in thread
From: Mark Nelson @ 2015-04-08 14:38 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel



On 04/07/2015 09:58 PM, Sage Weil wrote:
> On Tue, 7 Apr 2015, Mark Nelson wrote:
>> On 04/07/2015 02:16 PM, Mark Nelson wrote:
>>> On 04/07/2015 09:57 AM, Mark Nelson wrote:
>>>> Hi Guys,
>>>>
>>>> I ran some quick tests on Sage's newstore branch.  So far given that
>>>> this is a prototype, things are looking pretty good imho.  The 4MB
>>>> object rados bench read/write and small read performance looks
>>>> especially good.  Keep in mind that this is not using the SSD journals
>>>> in any way, so 640MB/s sequential writes is actually really good
>>>> compared to filestore without SSD journals.
>>>>
>>>> small write performance appears to be fairly bad, especially in the RBD
>>>> case where it's small writes to larger objects.  I'm going to sit down
>>>> and see if I can figure out what's going on.  It's bad enough that I
>>>> suspect there's just something odd going on.
>>>>
>>>> Mark
>>>
>>> Seekwatcher/blktrace graphs of a 4 OSD cluster using newstore for those
>>> interested:
>>>
>>> http://nhm.ceph.com/newstore/
>>>
>>> Interestingly small object write/read performance with 4 OSDs was about
>>> 1/3-1/4 the speed of the same cluster with 36 OSDs.
>>>
>>> Note: Thanks Dan for fixing the directory column width!
>>>
>>> Mark
>>
>> New fio/librbd results using Sage's latest code that attempts to keep small
>> overwrite extents in the db.  This is 4 OSD so not directly comparable to the
>> 36 OSD tests above, but does include seekwatcher graphs.  Results in MB/s:
>>
>> 	write	read	randw	randr
>> 4MB	57.9	319.6	55.2	285.9
>> 128KB	2.5	230.6	2.4	125.4
>> 4KB	0.46	55.65	1.11	3.56
>
> What would be very interesting would be to see the 4KB performance
> with the defaults (newstore overlay max = 32) vs overlays disabled
> (newstore overlay max = 0) and see if/how much it is helping.
>
> The latest branch also has open-by-handle.  It's on by default (newstore
> open by handle = true).  I think for most workloads it won't be very
> noticeable... I think there are two questions we need to answer though:
>
> 1) Does it have any impact on a creation workload (say, 4kb objects).  It
> shouldn't, but we should confirm.

4KB objects via rados bench ok?

>
> 2) Does it impact small object random reads with a cold cache.  I think to
> see the effect we'll probably need to pile a ton of objects into the
> store, drop caches, and then do random reads.  In the best case the
> effect will be small, but hopefully noticeable: we should go from
> a directory lookup (1+ seeks) + inode lookup (1+ seek) + data
> read, to inode lookup (1+ seek) + data read.  So, 3 -> 2 seeks best case?
> I'm not really sure what XFS is doing under the covers here...

So the above test process for RBD was basically:

1) create a configurable sized RBD volume (16GB in this case across 4 OSDs).
2) fill volume with 4MB writes to preallocate the blocks
3) repeat for each test:
3a) drop cache and sync
3b) Run the test



>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Initial newstore vs filestore results
  2015-04-08  7:24       ` Haomai Wang
@ 2015-04-08 16:49         ` Sage Weil
  2015-04-08 17:19           ` Gregory Farnum
  2015-04-08 19:16           ` Milosz Tanski
  0 siblings, 2 replies; 28+ messages in thread
From: Sage Weil @ 2015-04-08 16:49 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Mark Nelson, ceph-devel

On Wed, 8 Apr 2015, Haomai Wang wrote:
> On Wed, Apr 8, 2015 at 10:58 AM, Sage Weil <sage@newdream.net> wrote:
> > On Tue, 7 Apr 2015, Mark Nelson wrote:
> >> On 04/07/2015 02:16 PM, Mark Nelson wrote:
> >> > On 04/07/2015 09:57 AM, Mark Nelson wrote:
> >> > > Hi Guys,
> >> > >
> >> > > I ran some quick tests on Sage's newstore branch.  So far given that
> >> > > this is a prototype, things are looking pretty good imho.  The 4MB
> >> > > object rados bench read/write and small read performance looks
> >> > > especially good.  Keep in mind that this is not using the SSD journals
> >> > > in any way, so 640MB/s sequential writes is actually really good
> >> > > compared to filestore without SSD journals.
> >> > >
> >> > > small write performance appears to be fairly bad, especially in the RBD
> >> > > case where it's small writes to larger objects.  I'm going to sit down
> >> > > and see if I can figure out what's going on.  It's bad enough that I
> >> > > suspect there's just something odd going on.
> >> > >
> >> > > Mark
> >> >
> >> > Seekwatcher/blktrace graphs of a 4 OSD cluster using newstore for those
> >> > interested:
> >> >
> >> > http://nhm.ceph.com/newstore/
> >> >
> >> > Interestingly small object write/read performance with 4 OSDs was about
> >> > 1/3-1/4 the speed of the same cluster with 36 OSDs.
> >> >
> >> > Note: Thanks Dan for fixing the directory column width!
> >> >
> >> > Mark
> >>
> >> New fio/librbd results using Sage's latest code that attempts to keep small
> >> overwrite extents in the db.  This is 4 OSD so not directly comparable to the
> >> 36 OSD tests above, but does include seekwatcher graphs.  Results in MB/s:
> >>
> >>       write   read    randw   randr
> >> 4MB   57.9    319.6   55.2    285.9
> >> 128KB 2.5     230.6   2.4     125.4
> >> 4KB   0.46    55.65   1.11    3.56
> >
> > What would be very interesting would be to see the 4KB performance
> > with the defaults (newstore overlay max = 32) vs overlays disabled
> > (newstore overlay max = 0) and see if/how much it is helping.
> >
> > The latest branch also has open-by-handle.  It's on by default (newstore
> > open by handle = true).  I think for most workloads it won't be very
> > noticeable... I think there are two questions we need to answer though:
> >
> > 1) Does it have any impact on a creation workload (say, 4kb objects).  It
> > shouldn't, but we should confirm.
> >
> > 2) Does it impact small object random reads with a cold cache.  I think to
> > see the effect we'll probably need to pile a ton of objects into the
> > store, drop caches, and then do random reads.  In the best case the
> > effect will be small, but hopefully noticeable: we should go from
> > a directory lookup (1+ seeks) + inode lookup (1+ seek) + data
> > read, to inode lookup (1+ seek) + data read.  So, 3 -> 2 seeks best case?
> > I'm not really sure what XFS is doing under the covers here..
> 
> WOW, it's really a cool implementation beyond my original mind
> according to blueprint. Handler, overlay_map and data_map looks so
> flexible and make small io cheaper in theory. Now we only have 1
> element in data_map and I'm not sure your goal about the future's
> usage. Although I have a unclearly idea that it could enhance the role
> of NewStore and make local filesystem just as a block space allocator.
> Let NewStore own a variable of FTL(File Translation Layer), so many
> cool features could be added. What's your idea about data_map?

Exactly, that is one option.  The other is that we'd treat the data_map 
similar to overlay_map with a fixed or max extent size so that a large 
partial overwrite will mostly go to a new file instead of doing the 
slow WAL.

> My concern currently still is WAL after fsync and kv commiting, maybe
> fsync process is just fine because mostly we won't meet this case in
> rbd. But submit sync kv transaction isn't a low latency job I think,
> maybe we could let WAL parallel with kv commiting?(yes, I really
> concern the latency of one op :-) )

The WAL has to come after kv commit.  But the fsync after the wal 
completion sucks, especially since we are always dispatching a single 
fsync at a time so it's kind of worst-case seek behavior.  We could throw 
these into another parallel fsync queue so that the fs can batch them up, 
but I'm not sure we will enough parallelism.  What would really be nice is 
a batch fsync syscall, but in leiu of that maybe we wait until we have a 
bunch of fsyncs pending and then throw them at the kernel together in a 
bunch of threads?  Not sure.  These aren't normally time sensitive 
unless a read comes along (which is pretty rare), but they have to be done 
for correctness.

> Then from the actual rados write op, it will add setattr and
> omap_setkeys ops. Current NewStore looks plays badly for setattr. It
> always encode all xattrs(and other not so tiny fields) and write again
> (Is this true?) though it could batch multi transaction's onode write
> in short time.

Yeah, this could be optimized so that we only unpack and repack the 
bufferlist, or do a single walk through the buffer to do the updates 
(similar to what TMAP used to do).

> NewStore also employ much more workload to KeyValueDB compared to
> FileStore, so maybe we need to consider the rich workload again
> compared before. FileStore only use leveldb just for write workload
> mainly so leveldb could fit into greatly, but currently overlay
> keys(read) and onode(read) will occur a main latency source in normal
> IO I think. The default kvdb like leveldb and rocksdb both plays not
> well for random read workload, it maybe will be problem. Looking for
> another kv db maybe a choice.

I'm defaulting to rocksdb for now.  We should try LMDB at some point...

> And it still doesn't add journal codes for wal?

I'm pretty sure the WAL stuff is all complete?

Anyway, I think most of the pieces are there, so now it's a matter of 
figuring out how well they work for different workloads, then tuning and 
optimizing...

> Anyway, NewStore should cover more workloads compared to FileStore. Good 
> job!

Thanks!
sage

> 
> >
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> -- 
> Best Regards,
> 
> Wheat
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Initial newstore vs filestore results
  2015-04-08 16:49         ` Sage Weil
@ 2015-04-08 17:19           ` Gregory Farnum
  2015-04-08 17:38             ` Sage Weil
  2015-04-08 19:16           ` Milosz Tanski
  1 sibling, 1 reply; 28+ messages in thread
From: Gregory Farnum @ 2015-04-08 17:19 UTC (permalink / raw)
  To: Sage Weil; +Cc: Haomai Wang, Mark Nelson, ceph-devel

On Wed, Apr 8, 2015 at 9:49 AM, Sage Weil <sage@newdream.net> wrote:
> On Wed, 8 Apr 2015, Haomai Wang wrote:
>> On Wed, Apr 8, 2015 at 10:58 AM, Sage Weil <sage@newdream.net> wrote:
>> > On Tue, 7 Apr 2015, Mark Nelson wrote:
>> >> On 04/07/2015 02:16 PM, Mark Nelson wrote:
>> >> > On 04/07/2015 09:57 AM, Mark Nelson wrote:
>> >> > > Hi Guys,
>> >> > >
>> >> > > I ran some quick tests on Sage's newstore branch.  So far given that
>> >> > > this is a prototype, things are looking pretty good imho.  The 4MB
>> >> > > object rados bench read/write and small read performance looks
>> >> > > especially good.  Keep in mind that this is not using the SSD journals
>> >> > > in any way, so 640MB/s sequential writes is actually really good
>> >> > > compared to filestore without SSD journals.
>> >> > >
>> >> > > small write performance appears to be fairly bad, especially in the RBD
>> >> > > case where it's small writes to larger objects.  I'm going to sit down
>> >> > > and see if I can figure out what's going on.  It's bad enough that I
>> >> > > suspect there's just something odd going on.
>> >> > >
>> >> > > Mark
>> >> >
>> >> > Seekwatcher/blktrace graphs of a 4 OSD cluster using newstore for those
>> >> > interested:
>> >> >
>> >> > http://nhm.ceph.com/newstore/
>> >> >
>> >> > Interestingly small object write/read performance with 4 OSDs was about
>> >> > 1/3-1/4 the speed of the same cluster with 36 OSDs.
>> >> >
>> >> > Note: Thanks Dan for fixing the directory column width!
>> >> >
>> >> > Mark
>> >>
>> >> New fio/librbd results using Sage's latest code that attempts to keep small
>> >> overwrite extents in the db.  This is 4 OSD so not directly comparable to the
>> >> 36 OSD tests above, but does include seekwatcher graphs.  Results in MB/s:
>> >>
>> >>       write   read    randw   randr
>> >> 4MB   57.9    319.6   55.2    285.9
>> >> 128KB 2.5     230.6   2.4     125.4
>> >> 4KB   0.46    55.65   1.11    3.56
>> >
>> > What would be very interesting would be to see the 4KB performance
>> > with the defaults (newstore overlay max = 32) vs overlays disabled
>> > (newstore overlay max = 0) and see if/how much it is helping.
>> >
>> > The latest branch also has open-by-handle.  It's on by default (newstore
>> > open by handle = true).  I think for most workloads it won't be very
>> > noticeable... I think there are two questions we need to answer though:
>> >
>> > 1) Does it have any impact on a creation workload (say, 4kb objects).  It
>> > shouldn't, but we should confirm.
>> >
>> > 2) Does it impact small object random reads with a cold cache.  I think to
>> > see the effect we'll probably need to pile a ton of objects into the
>> > store, drop caches, and then do random reads.  In the best case the
>> > effect will be small, but hopefully noticeable: we should go from
>> > a directory lookup (1+ seeks) + inode lookup (1+ seek) + data
>> > read, to inode lookup (1+ seek) + data read.  So, 3 -> 2 seeks best case?
>> > I'm not really sure what XFS is doing under the covers here..
>>
>> WOW, it's really a cool implementation beyond my original mind
>> according to blueprint. Handler, overlay_map and data_map looks so
>> flexible and make small io cheaper in theory. Now we only have 1
>> element in data_map and I'm not sure your goal about the future's
>> usage. Although I have a unclearly idea that it could enhance the role
>> of NewStore and make local filesystem just as a block space allocator.
>> Let NewStore own a variable of FTL(File Translation Layer), so many
>> cool features could be added. What's your idea about data_map?
>
> Exactly, that is one option.  The other is that we'd treat the data_map
> similar to overlay_map with a fixed or max extent size so that a large
> partial overwrite will mostly go to a new file instead of doing the
> slow WAL.
>
>> My concern currently still is WAL after fsync and kv commiting, maybe
>> fsync process is just fine because mostly we won't meet this case in
>> rbd. But submit sync kv transaction isn't a low latency job I think,
>> maybe we could let WAL parallel with kv commiting?(yes, I really
>> concern the latency of one op :-) )
>
> The WAL has to come after kv commit.  But the fsync after the wal
> completion sucks, especially since we are always dispatching a single
> fsync at a time so it's kind of worst-case seek behavior.  We could throw
> these into another parallel fsync queue so that the fs can batch them up,
> but I'm not sure we will enough parallelism.  What would really be nice is
> a batch fsync syscall, but in leiu of that maybe we wait until we have a
> bunch of fsyncs pending and then throw them at the kernel together in a
> bunch of threads?  Not sure.  These aren't normally time sensitive
> unless a read comes along (which is pretty rare), but they have to be done
> for correctness.

Couldn't we write both the log entry and the data in parallel and only
acknowledge to the client once both commit?
If we replay the log without the data we'll know it didn't get
committed, and we can collect the data after replay if it's not
referenced by the log (I'm speculating, as I haven't looked at the
code or how it's actually choosing names).
-Greg

>
>> Then from the actual rados write op, it will add setattr and
>> omap_setkeys ops. Current NewStore looks plays badly for setattr. It
>> always encode all xattrs(and other not so tiny fields) and write again
>> (Is this true?) though it could batch multi transaction's onode write
>> in short time.
>
> Yeah, this could be optimized so that we only unpack and repack the
> bufferlist, or do a single walk through the buffer to do the updates
> (similar to what TMAP used to do).
>
>> NewStore also employ much more workload to KeyValueDB compared to
>> FileStore, so maybe we need to consider the rich workload again
>> compared before. FileStore only use leveldb just for write workload
>> mainly so leveldb could fit into greatly, but currently overlay
>> keys(read) and onode(read) will occur a main latency source in normal
>> IO I think. The default kvdb like leveldb and rocksdb both plays not
>> well for random read workload, it maybe will be problem. Looking for
>> another kv db maybe a choice.
>
> I'm defaulting to rocksdb for now.  We should try LMDB at some point...
>
>> And it still doesn't add journal codes for wal?
>
> I'm pretty sure the WAL stuff is all complete?
>
> Anyway, I think most of the pieces are there, so now it's a matter of
> figuring out how well they work for different workloads, then tuning and
> optimizing...
>
>> Anyway, NewStore should cover more workloads compared to FileStore. Good
>> job!
>
> Thanks!
> sage
>
>>
>> >
>> > sage
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Best Regards,
>>
>> Wheat
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Initial newstore vs filestore results
  2015-04-08 17:19           ` Gregory Farnum
@ 2015-04-08 17:38             ` Sage Weil
  0 siblings, 0 replies; 28+ messages in thread
From: Sage Weil @ 2015-04-08 17:38 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Haomai Wang, Mark Nelson, ceph-devel

On Wed, 8 Apr 2015, Gregory Farnum wrote:
> On Wed, Apr 8, 2015 at 9:49 AM, Sage Weil <sage@newdream.net> wrote:
> > On Wed, 8 Apr 2015, Haomai Wang wrote:
> >> On Wed, Apr 8, 2015 at 10:58 AM, Sage Weil <sage@newdream.net> wrote:
> >> > On Tue, 7 Apr 2015, Mark Nelson wrote:
> >> >> On 04/07/2015 02:16 PM, Mark Nelson wrote:
> >> >> > On 04/07/2015 09:57 AM, Mark Nelson wrote:
> >> >> > > Hi Guys,
> >> >> > >
> >> >> > > I ran some quick tests on Sage's newstore branch.  So far given that
> >> >> > > this is a prototype, things are looking pretty good imho.  The 4MB
> >> >> > > object rados bench read/write and small read performance looks
> >> >> > > especially good.  Keep in mind that this is not using the SSD journals
> >> >> > > in any way, so 640MB/s sequential writes is actually really good
> >> >> > > compared to filestore without SSD journals.
> >> >> > >
> >> >> > > small write performance appears to be fairly bad, especially in the RBD
> >> >> > > case where it's small writes to larger objects.  I'm going to sit down
> >> >> > > and see if I can figure out what's going on.  It's bad enough that I
> >> >> > > suspect there's just something odd going on.
> >> >> > >
> >> >> > > Mark
> >> >> >
> >> >> > Seekwatcher/blktrace graphs of a 4 OSD cluster using newstore for those
> >> >> > interested:
> >> >> >
> >> >> > http://nhm.ceph.com/newstore/
> >> >> >
> >> >> > Interestingly small object write/read performance with 4 OSDs was about
> >> >> > 1/3-1/4 the speed of the same cluster with 36 OSDs.
> >> >> >
> >> >> > Note: Thanks Dan for fixing the directory column width!
> >> >> >
> >> >> > Mark
> >> >>
> >> >> New fio/librbd results using Sage's latest code that attempts to keep small
> >> >> overwrite extents in the db.  This is 4 OSD so not directly comparable to the
> >> >> 36 OSD tests above, but does include seekwatcher graphs.  Results in MB/s:
> >> >>
> >> >>       write   read    randw   randr
> >> >> 4MB   57.9    319.6   55.2    285.9
> >> >> 128KB 2.5     230.6   2.4     125.4
> >> >> 4KB   0.46    55.65   1.11    3.56
> >> >
> >> > What would be very interesting would be to see the 4KB performance
> >> > with the defaults (newstore overlay max = 32) vs overlays disabled
> >> > (newstore overlay max = 0) and see if/how much it is helping.
> >> >
> >> > The latest branch also has open-by-handle.  It's on by default (newstore
> >> > open by handle = true).  I think for most workloads it won't be very
> >> > noticeable... I think there are two questions we need to answer though:
> >> >
> >> > 1) Does it have any impact on a creation workload (say, 4kb objects).  It
> >> > shouldn't, but we should confirm.
> >> >
> >> > 2) Does it impact small object random reads with a cold cache.  I think to
> >> > see the effect we'll probably need to pile a ton of objects into the
> >> > store, drop caches, and then do random reads.  In the best case the
> >> > effect will be small, but hopefully noticeable: we should go from
> >> > a directory lookup (1+ seeks) + inode lookup (1+ seek) + data
> >> > read, to inode lookup (1+ seek) + data read.  So, 3 -> 2 seeks best case?
> >> > I'm not really sure what XFS is doing under the covers here..
> >>
> >> WOW, it's really a cool implementation beyond my original mind
> >> according to blueprint. Handler, overlay_map and data_map looks so
> >> flexible and make small io cheaper in theory. Now we only have 1
> >> element in data_map and I'm not sure your goal about the future's
> >> usage. Although I have a unclearly idea that it could enhance the role
> >> of NewStore and make local filesystem just as a block space allocator.
> >> Let NewStore own a variable of FTL(File Translation Layer), so many
> >> cool features could be added. What's your idea about data_map?
> >
> > Exactly, that is one option.  The other is that we'd treat the data_map
> > similar to overlay_map with a fixed or max extent size so that a large
> > partial overwrite will mostly go to a new file instead of doing the
> > slow WAL.
> >
> >> My concern currently still is WAL after fsync and kv commiting, maybe
> >> fsync process is just fine because mostly we won't meet this case in
> >> rbd. But submit sync kv transaction isn't a low latency job I think,
> >> maybe we could let WAL parallel with kv commiting?(yes, I really
> >> concern the latency of one op :-) )
> >
> > The WAL has to come after kv commit.  But the fsync after the wal
> > completion sucks, especially since we are always dispatching a single
> > fsync at a time so it's kind of worst-case seek behavior.  We could throw
> > these into another parallel fsync queue so that the fs can batch them up,
> > but I'm not sure we will enough parallelism.  What would really be nice is
> > a batch fsync syscall, but in leiu of that maybe we wait until we have a
> > bunch of fsyncs pending and then throw them at the kernel together in a
> > bunch of threads?  Not sure.  These aren't normally time sensitive
> > unless a read comes along (which is pretty rare), but they have to be done
> > for correctness.
> 
> Couldn't we write both the log entry and the data in parallel and only
> acknowledge to the client once both commit?
> If we replay the log without the data we'll know it didn't get
> committed, and we can collect the data after replay if it's not
> referenced by the log (I'm speculating, as I haven't looked at the
> code or how it's actually choosing names).

We're only doing WAL for things where we need atomicity, like a partial 
overwrite.  In that case, we need to avoid clobbering old state until we 
the new transaction in its entirety.  I'm not sure we can get around 
those...

There are times when we can avoid it, though, mainly with creation of new 
objects and with appends.  In both of those cases we write directly to the 
file, fsync, and then commit the kv transaction.  I guess we could do 
something clever there and roll-back the transaction if we find that the 
backing file doesn't have what we expected it to.  I'm worried that will 
turn out to be really complex, though: we'd need to roll back the kv 
transaction, including any other changes it made (creating, 
deleting, or updating other k/v pairs), *and* any later transactions on 
the same sequencer.

s


> -Greg
> 
> >
> >> Then from the actual rados write op, it will add setattr and
> >> omap_setkeys ops. Current NewStore looks plays badly for setattr. It
> >> always encode all xattrs(and other not so tiny fields) and write again
> >> (Is this true?) though it could batch multi transaction's onode write
> >> in short time.
> >
> > Yeah, this could be optimized so that we only unpack and repack the
> > bufferlist, or do a single walk through the buffer to do the updates
> > (similar to what TMAP used to do).
> >
> >> NewStore also employ much more workload to KeyValueDB compared to
> >> FileStore, so maybe we need to consider the rich workload again
> >> compared before. FileStore only use leveldb just for write workload
> >> mainly so leveldb could fit into greatly, but currently overlay
> >> keys(read) and onode(read) will occur a main latency source in normal
> >> IO I think. The default kvdb like leveldb and rocksdb both plays not
> >> well for random read workload, it maybe will be problem. Looking for
> >> another kv db maybe a choice.
> >
> > I'm defaulting to rocksdb for now.  We should try LMDB at some point...
> >
> >> And it still doesn't add journal codes for wal?
> >
> > I'm pretty sure the WAL stuff is all complete?
> >
> > Anyway, I think most of the pieces are there, so now it's a matter of
> > figuring out how well they work for different workloads, then tuning and
> > optimizing...
> >
> >> Anyway, NewStore should cover more workloads compared to FileStore. Good
> >> job!
> >
> > Thanks!
> > sage
> >
> >>
> >> >
> >> > sage
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> > the body of a message to majordomo@vger.kernel.org
> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>
> >>
> >> --
> >> Best Regards,
> >>
> >> Wheat
> >>
> >>
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Initial newstore vs filestore results
  2015-04-08 16:49         ` Sage Weil
  2015-04-08 17:19           ` Gregory Farnum
@ 2015-04-08 19:16           ` Milosz Tanski
  1 sibling, 0 replies; 28+ messages in thread
From: Milosz Tanski @ 2015-04-08 19:16 UTC (permalink / raw)
  To: Sage Weil; +Cc: Haomai Wang, Mark Nelson, ceph-devel

On Wed, Apr 8, 2015 at 12:49 PM, Sage Weil <sage@newdream.net> wrote:
> On Wed, 8 Apr 2015, Haomai Wang wrote:
>> On Wed, Apr 8, 2015 at 10:58 AM, Sage Weil <sage@newdream.net> wrote:
>> > On Tue, 7 Apr 2015, Mark Nelson wrote:
>> >> On 04/07/2015 02:16 PM, Mark Nelson wrote:
>> >> > On 04/07/2015 09:57 AM, Mark Nelson wrote:
>> >> > > Hi Guys,
>> >> > >
>> >> > > I ran some quick tests on Sage's newstore branch.  So far given that
>> >> > > this is a prototype, things are looking pretty good imho.  The 4MB
>> >> > > object rados bench read/write and small read performance looks
>> >> > > especially good.  Keep in mind that this is not using the SSD journals
>> >> > > in any way, so 640MB/s sequential writes is actually really good
>> >> > > compared to filestore without SSD journals.
>> >> > >
>> >> > > small write performance appears to be fairly bad, especially in the RBD
>> >> > > case where it's small writes to larger objects.  I'm going to sit down
>> >> > > and see if I can figure out what's going on.  It's bad enough that I
>> >> > > suspect there's just something odd going on.
>> >> > >
>> >> > > Mark
>> >> >
>> >> > Seekwatcher/blktrace graphs of a 4 OSD cluster using newstore for those
>> >> > interested:
>> >> >
>> >> > http://nhm.ceph.com/newstore/
>> >> >
>> >> > Interestingly small object write/read performance with 4 OSDs was about
>> >> > 1/3-1/4 the speed of the same cluster with 36 OSDs.
>> >> >
>> >> > Note: Thanks Dan for fixing the directory column width!
>> >> >
>> >> > Mark
>> >>
>> >> New fio/librbd results using Sage's latest code that attempts to keep small
>> >> overwrite extents in the db.  This is 4 OSD so not directly comparable to the
>> >> 36 OSD tests above, but does include seekwatcher graphs.  Results in MB/s:
>> >>
>> >>       write   read    randw   randr
>> >> 4MB   57.9    319.6   55.2    285.9
>> >> 128KB 2.5     230.6   2.4     125.4
>> >> 4KB   0.46    55.65   1.11    3.56
>> >
>> > What would be very interesting would be to see the 4KB performance
>> > with the defaults (newstore overlay max = 32) vs overlays disabled
>> > (newstore overlay max = 0) and see if/how much it is helping.
>> >
>> > The latest branch also has open-by-handle.  It's on by default (newstore
>> > open by handle = true).  I think for most workloads it won't be very
>> > noticeable... I think there are two questions we need to answer though:
>> >
>> > 1) Does it have any impact on a creation workload (say, 4kb objects).  It
>> > shouldn't, but we should confirm.
>> >
>> > 2) Does it impact small object random reads with a cold cache.  I think to
>> > see the effect we'll probably need to pile a ton of objects into the
>> > store, drop caches, and then do random reads.  In the best case the
>> > effect will be small, but hopefully noticeable: we should go from
>> > a directory lookup (1+ seeks) + inode lookup (1+ seek) + data
>> > read, to inode lookup (1+ seek) + data read.  So, 3 -> 2 seeks best case?
>> > I'm not really sure what XFS is doing under the covers here..
>>
>> WOW, it's really a cool implementation beyond my original mind
>> according to blueprint. Handler, overlay_map and data_map looks so
>> flexible and make small io cheaper in theory. Now we only have 1
>> element in data_map and I'm not sure your goal about the future's
>> usage. Although I have a unclearly idea that it could enhance the role
>> of NewStore and make local filesystem just as a block space allocator.
>> Let NewStore own a variable of FTL(File Translation Layer), so many
>> cool features could be added. What's your idea about data_map?
>
> Exactly, that is one option.  The other is that we'd treat the data_map
> similar to overlay_map with a fixed or max extent size so that a large
> partial overwrite will mostly go to a new file instead of doing the
> slow WAL.
>
>> My concern currently still is WAL after fsync and kv commiting, maybe
>> fsync process is just fine because mostly we won't meet this case in
>> rbd. But submit sync kv transaction isn't a low latency job I think,
>> maybe we could let WAL parallel with kv commiting?(yes, I really
>> concern the latency of one op :-) )
>
> The WAL has to come after kv commit.  But the fsync after the wal
> completion sucks, especially since we are always dispatching a single
> fsync at a time so it's kind of worst-case seek behavior.  We could throw
> these into another parallel fsync queue so that the fs can batch them up,
> but I'm not sure we will enough parallelism.  What would really be nice is
> a batch fsync syscall, but in leiu of that maybe we wait until we have a
> bunch of fsyncs pending and then throw them at the kernel together in a
> bunch of threads?  Not sure.  These aren't normally time sensitive
> unless a read comes along (which is pretty rare), but they have to be done
> for correctness.
>
>> Then from the actual rados write op, it will add setattr and
>> omap_setkeys ops. Current NewStore looks plays badly for setattr. It
>> always encode all xattrs(and other not so tiny fields) and write again
>> (Is this true?) though it could batch multi transaction's onode write
>> in short time.
>
> Yeah, this could be optimized so that we only unpack and repack the
> bufferlist, or do a single walk through the buffer to do the updates
> (similar to what TMAP used to do).
>
>> NewStore also employ much more workload to KeyValueDB compared to
>> FileStore, so maybe we need to consider the rich workload again
>> compared before. FileStore only use leveldb just for write workload
>> mainly so leveldb could fit into greatly, but currently overlay
>> keys(read) and onode(read) will occur a main latency source in normal
>> IO I think. The default kvdb like leveldb and rocksdb both plays not
>> well for random read workload, it maybe will be problem. Looking for
>> another kv db maybe a choice.
>
> I'm defaulting to rocksdb for now.  We should try LMDB at some point...
>

This might be a bit tangential to the ongoing effort, but I think the
idea combines a couple problems (solutions) together.

You could use make a store that use LMDB directly on the partition
(block device)... and in my mind that interesting because:
- You get a durable data store without write amplification of WAL or
LSM-Tree. It does this by having a COW BTree.
- You can batch "fsyncs". This would require some logic to merge
multiple unrelated Ceph OSD ops into a single LMDB transaction, but I
think it's doable.
- Theoretically you get to avoid a bunch of overhead of having a BTree
(database) on a BTree (filesystem).


-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Initial newstore vs filestore results
  2015-04-08  2:58     ` Sage Weil
  2015-04-08  7:24       ` Haomai Wang
  2015-04-08 14:38       ` Mark Nelson
@ 2015-04-09  3:19       ` Mark Nelson
  2015-04-09 17:00         ` Mark Nelson
  2 siblings, 1 reply; 28+ messages in thread
From: Mark Nelson @ 2015-04-09  3:19 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On 04/07/2015 09:58 PM, Sage Weil wrote:
> On Tue, 7 Apr 2015, Mark Nelson wrote:
>> On 04/07/2015 02:16 PM, Mark Nelson wrote:
>>> On 04/07/2015 09:57 AM, Mark Nelson wrote:
>>>> Hi Guys,
>>>>
>>>> I ran some quick tests on Sage's newstore branch.  So far given that
>>>> this is a prototype, things are looking pretty good imho.  The 4MB
>>>> object rados bench read/write and small read performance looks
>>>> especially good.  Keep in mind that this is not using the SSD journals
>>>> in any way, so 640MB/s sequential writes is actually really good
>>>> compared to filestore without SSD journals.
>>>>
>>>> small write performance appears to be fairly bad, especially in the RBD
>>>> case where it's small writes to larger objects.  I'm going to sit down
>>>> and see if I can figure out what's going on.  It's bad enough that I
>>>> suspect there's just something odd going on.
>>>>
>>>> Mark
>>>
>>> Seekwatcher/blktrace graphs of a 4 OSD cluster using newstore for those
>>> interested:
>>>
>>> http://nhm.ceph.com/newstore/
>>>
>>> Interestingly small object write/read performance with 4 OSDs was about
>>> 1/3-1/4 the speed of the same cluster with 36 OSDs.
>>>
>>> Note: Thanks Dan for fixing the directory column width!
>>>
>>> Mark
>>
>> New fio/librbd results using Sage's latest code that attempts to keep small
>> overwrite extents in the db.  This is 4 OSD so not directly comparable to the
>> 36 OSD tests above, but does include seekwatcher graphs.  Results in MB/s:
>>
>> 	write	read	randw	randr
>> 4MB	57.9	319.6	55.2	285.9
>> 128KB	2.5	230.6	2.4	125.4
>> 4KB	0.46	55.65	1.11	3.56
>
> What would be very interesting would be to see the 4KB performance
> with the defaults (newstore overlay max = 32) vs overlays disabled
> (newstore overlay max = 0) and see if/how much it is helping.

And here we go.  1 OSD, 1X replication.  16GB RBD volume.

4MB		write	read	randw	randr
default overlay	36.13	106.61	34.49	92.69
no overlay	36.29	105.61	34.49	93.55
				
128KB		write	read	randw	randr
default overlay	1.71	97.90	1.65	25.79
no overlay	1.72	97.80	1.66	25.78
				
4KB		write	read	randw	randr
default overlay	0.40	61.88	1.29	1.11
no overlay	0.05	61.26	0.05	1.10

seekwatcher movies generating now, but I'm going to bed soon so I'll 
have to wait until tomorrow morning to post them. :)

>
> The latest branch also has open-by-handle.  It's on by default (newstore
> open by handle = true).  I think for most workloads it won't be very
> noticeable... I think there are two questions we need to answer though:
>
> 1) Does it have any impact on a creation workload (say, 4kb objects).  It
> shouldn't, but we should confirm.
>
> 2) Does it impact small object random reads with a cold cache.  I think to
> see the effect we'll probably need to pile a ton of objects into the
> store, drop caches, and then do random reads.  In the best case the
> effect will be small, but hopefully noticeable: we should go from
> a directory lookup (1+ seeks) + inode lookup (1+ seek) + data
> read, to inode lookup (1+ seek) + data read.  So, 3 -> 2 seeks best case?
> I'm not really sure what XFS is doing under the covers here...
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Initial newstore vs filestore results
  2015-04-09  3:19       ` Mark Nelson
@ 2015-04-09 17:00         ` Mark Nelson
  2015-04-10  6:11           ` Duan, Jiangang
  0 siblings, 1 reply; 28+ messages in thread
From: Mark Nelson @ 2015-04-09 17:00 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On 04/08/2015 10:19 PM, Mark Nelson wrote:
> On 04/07/2015 09:58 PM, Sage Weil wrote:
>> What would be very interesting would be to see the 4KB performance
>> with the defaults (newstore overlay max = 32) vs overlays disabled
>> (newstore overlay max = 0) and see if/how much it is helping.
>
> And here we go.  1 OSD, 1X replication.  16GB RBD volume.
>
> 4MB        write    read    randw    randr
> default overlay    36.13    106.61    34.49    92.69
> no overlay    36.29    105.61    34.49    93.55
>
> 128KB        write    read    randw    randr
> default overlay    1.71    97.90    1.65    25.79
> no overlay    1.72    97.80    1.66    25.78
>
> 4KB        write    read    randw    randr
> default overlay    0.40    61.88    1.29    1.11
> no overlay    0.05    61.26    0.05    1.10
>

Update this morning.  Also ran filestore tests for comparison.  Next 
we'll look at how tweaking the overlay for different IO sizes affects 
things.  IE the overlay threshold is 64k right now and it appears that 
128K write IOs for instance are quite a bit worse with newstore 
currently than with filestore.  Sage also just committed changes that 
will allow overlay writes during append/create which may help improve 
small IO write performance as well in some cases.

4MB		write	read	randw	randr
default overlay	36.13	106.61	34.49	92.69
no overlay	36.29	105.61	34.49	93.55
filestore	36.17	84.59	34.11	79.85
				
128KB		write	read	randw	randr
default overlay	1.71	97.90	1.65	25.79
no overlay	1.72	97.80	1.66	25.78
filestore	27.15	79.91	8.77	19.00
				
4KB		write	read	randw	randr
default overlay	0.40	61.88	1.29	1.11
no overlay	0.05	61.26	0.05	1.10
filestore	4.14	56.30	0.42	0.76

Seekwatcher movies and graphs available here:

http://nhm.ceph.com/newstore/20150408/

Note for instance the very interesting blktrace patterns for 4K random 
writes on the OSD in each case:

http://nhm.ceph.com/newstore/20150408/filestore/RBD_00004096_randwrite.png
http://nhm.ceph.com/newstore/20150408/default_overlay/RBD_00004096_randwrite.png
http://nhm.ceph.com/newstore/20150408/no_overlay/RBD_00004096_randwrite.png

Mark

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Initial newstore vs filestore results
  2015-04-09 17:00         ` Mark Nelson
@ 2015-04-10  6:11           ` Duan, Jiangang
  2015-04-10 10:25             ` Ning Yao
  2015-04-10 12:07             ` Mark Nelson
  0 siblings, 2 replies; 28+ messages in thread
From: Duan, Jiangang @ 2015-04-10  6:11 UTC (permalink / raw)
  To: Mark Nelson, Sage Weil; +Cc: ceph-devel

IMHO, the newstore performance depends so much on KV store performance due to the WAL -  so pick up the right KV or tune it will be the 1st step to do. 

-jiangang


-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Friday, April 10, 2015 1:01 AM
To: Sage Weil
Cc: ceph-devel
Subject: Re: Initial newstore vs filestore results

On 04/08/2015 10:19 PM, Mark Nelson wrote:
> On 04/07/2015 09:58 PM, Sage Weil wrote:
>> What would be very interesting would be to see the 4KB performance 
>> with the defaults (newstore overlay max = 32) vs overlays disabled 
>> (newstore overlay max = 0) and see if/how much it is helping.
>
> And here we go.  1 OSD, 1X replication.  16GB RBD volume.
>
> 4MB        write    read    randw    randr
> default overlay    36.13    106.61    34.49    92.69
> no overlay    36.29    105.61    34.49    93.55
>
> 128KB        write    read    randw    randr
> default overlay    1.71    97.90    1.65    25.79
> no overlay    1.72    97.80    1.66    25.78
>
> 4KB        write    read    randw    randr
> default overlay    0.40    61.88    1.29    1.11
> no overlay    0.05    61.26    0.05    1.10
>

Update this morning.  Also ran filestore tests for comparison.  Next we'll look at how tweaking the overlay for different IO sizes affects things.  IE the overlay threshold is 64k right now and it appears that 128K write IOs for instance are quite a bit worse with newstore currently than with filestore.  Sage also just committed changes that will allow overlay writes during append/create which may help improve small IO write performance as well in some cases.

4MB		write	read	randw	randr
default overlay	36.13	106.61	34.49	92.69
no overlay	36.29	105.61	34.49	93.55
filestore	36.17	84.59	34.11	79.85
				
128KB		write	read	randw	randr
default overlay	1.71	97.90	1.65	25.79
no overlay	1.72	97.80	1.66	25.78
filestore	27.15	79.91	8.77	19.00
				
4KB		write	read	randw	randr
default overlay	0.40	61.88	1.29	1.11
no overlay	0.05	61.26	0.05	1.10
filestore	4.14	56.30	0.42	0.76

Seekwatcher movies and graphs available here:

http://nhm.ceph.com/newstore/20150408/

Note for instance the very interesting blktrace patterns for 4K random writes on the OSD in each case:

http://nhm.ceph.com/newstore/20150408/filestore/RBD_00004096_randwrite.png
http://nhm.ceph.com/newstore/20150408/default_overlay/RBD_00004096_randwrite.png
http://nhm.ceph.com/newstore/20150408/no_overlay/RBD_00004096_randwrite.png

Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Initial newstore vs filestore results
  2015-04-10  6:11           ` Duan, Jiangang
@ 2015-04-10 10:25             ` Ning Yao
  2015-04-10 15:28               ` Sage Weil
  2015-04-10 12:07             ` Mark Nelson
  1 sibling, 1 reply; 28+ messages in thread
From: Ning Yao @ 2015-04-10 10:25 UTC (permalink / raw)
  To: Duan, Jiangang; +Cc: Mark Nelson, Sage Weil, ceph-devel

KV store introduces too much write amplification, we may need
self-implemented WAL?
Regards
Ning Yao


2015-04-10 14:11 GMT+08:00 Duan, Jiangang <jiangang.duan@intel.com>:
> IMHO, the newstore performance depends so much on KV store performance due to the WAL -  so pick up the right KV or tune it will be the 1st step to do.
>
> -jiangang
>
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Friday, April 10, 2015 1:01 AM
> To: Sage Weil
> Cc: ceph-devel
> Subject: Re: Initial newstore vs filestore results
>
> On 04/08/2015 10:19 PM, Mark Nelson wrote:
>> On 04/07/2015 09:58 PM, Sage Weil wrote:
>>> What would be very interesting would be to see the 4KB performance
>>> with the defaults (newstore overlay max = 32) vs overlays disabled
>>> (newstore overlay max = 0) and see if/how much it is helping.
>>
>> And here we go.  1 OSD, 1X replication.  16GB RBD volume.
>>
>> 4MB        write    read    randw    randr
>> default overlay    36.13    106.61    34.49    92.69
>> no overlay    36.29    105.61    34.49    93.55
>>
>> 128KB        write    read    randw    randr
>> default overlay    1.71    97.90    1.65    25.79
>> no overlay    1.72    97.80    1.66    25.78
>>
>> 4KB        write    read    randw    randr
>> default overlay    0.40    61.88    1.29    1.11
>> no overlay    0.05    61.26    0.05    1.10
>>
>
> Update this morning.  Also ran filestore tests for comparison.  Next we'll look at how tweaking the overlay for different IO sizes affects things.  IE the overlay threshold is 64k right now and it appears that 128K write IOs for instance are quite a bit worse with newstore currently than with filestore.  Sage also just committed changes that will allow overlay writes during append/create which may help improve small IO write performance as well in some cases.
>
> 4MB             write   read    randw   randr
> default overlay 36.13   106.61  34.49   92.69
> no overlay      36.29   105.61  34.49   93.55
> filestore       36.17   84.59   34.11   79.85
>
> 128KB           write   read    randw   randr
> default overlay 1.71    97.90   1.65    25.79
> no overlay      1.72    97.80   1.66    25.78
> filestore       27.15   79.91   8.77    19.00
>
> 4KB             write   read    randw   randr
> default overlay 0.40    61.88   1.29    1.11
> no overlay      0.05    61.26   0.05    1.10
> filestore       4.14    56.30   0.42    0.76
>
> Seekwatcher movies and graphs available here:
>
> http://nhm.ceph.com/newstore/20150408/
>
> Note for instance the very interesting blktrace patterns for 4K random writes on the OSD in each case:
>
> http://nhm.ceph.com/newstore/20150408/filestore/RBD_00004096_randwrite.png
> http://nhm.ceph.com/newstore/20150408/default_overlay/RBD_00004096_randwrite.png
> http://nhm.ceph.com/newstore/20150408/no_overlay/RBD_00004096_randwrite.png
>
> Mark
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Initial newstore vs filestore results
  2015-04-10  6:11           ` Duan, Jiangang
  2015-04-10 10:25             ` Ning Yao
@ 2015-04-10 12:07             ` Mark Nelson
  1 sibling, 0 replies; 28+ messages in thread
From: Mark Nelson @ 2015-04-10 12:07 UTC (permalink / raw)
  To: Duan, Jiangang, Sage Weil; +Cc: ceph-devel

I don't disagree, but I agree with Sage that testing different overlay 
values is useful as well.  I will have results to post later this 
morning.  At some point soon I'll move on to testing how rocksdb WAL on 
SSD and/or rocksdb entirely on SSD helps. There's definitely some 
interesting trade-offs here.

Mark

On 04/10/2015 01:11 AM, Duan, Jiangang wrote:
> IMHO, the newstore performance depends so much on KV store performance due to the WAL -  so pick up the right KV or tune it will be the 1st step to do.
>
> -jiangang
>
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Friday, April 10, 2015 1:01 AM
> To: Sage Weil
> Cc: ceph-devel
> Subject: Re: Initial newstore vs filestore results
>
> On 04/08/2015 10:19 PM, Mark Nelson wrote:
>> On 04/07/2015 09:58 PM, Sage Weil wrote:
>>> What would be very interesting would be to see the 4KB performance
>>> with the defaults (newstore overlay max = 32) vs overlays disabled
>>> (newstore overlay max = 0) and see if/how much it is helping.
>>
>> And here we go.  1 OSD, 1X replication.  16GB RBD volume.
>>
>> 4MB        write    read    randw    randr
>> default overlay    36.13    106.61    34.49    92.69
>> no overlay    36.29    105.61    34.49    93.55
>>
>> 128KB        write    read    randw    randr
>> default overlay    1.71    97.90    1.65    25.79
>> no overlay    1.72    97.80    1.66    25.78
>>
>> 4KB        write    read    randw    randr
>> default overlay    0.40    61.88    1.29    1.11
>> no overlay    0.05    61.26    0.05    1.10
>>
>
> Update this morning.  Also ran filestore tests for comparison.  Next we'll look at how tweaking the overlay for different IO sizes affects things.  IE the overlay threshold is 64k right now and it appears that 128K write IOs for instance are quite a bit worse with newstore currently than with filestore.  Sage also just committed changes that will allow overlay writes during append/create which may help improve small IO write performance as well in some cases.
>
> 4MB		write	read	randw	randr
> default overlay	36.13	106.61	34.49	92.69
> no overlay	36.29	105.61	34.49	93.55
> filestore	36.17	84.59	34.11	79.85
> 				
> 128KB		write	read	randw	randr
> default overlay	1.71	97.90	1.65	25.79
> no overlay	1.72	97.80	1.66	25.78
> filestore	27.15	79.91	8.77	19.00
> 				
> 4KB		write	read	randw	randr
> default overlay	0.40	61.88	1.29	1.11
> no overlay	0.05	61.26	0.05	1.10
> filestore	4.14	56.30	0.42	0.76
>
> Seekwatcher movies and graphs available here:
>
> http://nhm.ceph.com/newstore/20150408/
>
> Note for instance the very interesting blktrace patterns for 4K random writes on the OSD in each case:
>
> http://nhm.ceph.com/newstore/20150408/filestore/RBD_00004096_randwrite.png
> http://nhm.ceph.com/newstore/20150408/default_overlay/RBD_00004096_randwrite.png
> http://nhm.ceph.com/newstore/20150408/no_overlay/RBD_00004096_randwrite.png
>
> Mark
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Initial newstore vs filestore results
  2015-04-10 10:25             ` Ning Yao
@ 2015-04-10 15:28               ` Sage Weil
  2015-04-10 15:53                 ` Mark Nelson
  0 siblings, 1 reply; 28+ messages in thread
From: Sage Weil @ 2015-04-10 15:28 UTC (permalink / raw)
  To: Ning Yao; +Cc: Duan, Jiangang, Mark Nelson, ceph-devel

On Fri, 10 Apr 2015, Ning Yao wrote:
> KV store introduces too much write amplification, we may need
> self-implemented WAL?

What we really want is to hint to the kv store that these keys (or this 
key range) is short-lived and should never get compacted.  And/or, we need 
to just make sure the wal is sufficiently large so that in practice that 
never happens to those keys.

Putting them outside the kv store means an additional seek/sync for disks, 
which defeats most of the purpose.  Maybe it makes sense for flash... but 
the above avoids the problem in either case.

I think we should target rocksdb for our initial tuning attempts.  So far 
all I've done is played a bit with the file size (1mb -> 4mb -> 8mb) 
but my ad hoc tests didn't see much difference.

sage



> Regards
> Ning Yao
> 
> 
> 2015-04-10 14:11 GMT+08:00 Duan, Jiangang <jiangang.duan@intel.com>:
> > IMHO, the newstore performance depends so much on KV store performance due to the WAL -  so pick up the right KV or tune it will be the 1st step to do.
> >
> > -jiangang
> >
> >
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> > Sent: Friday, April 10, 2015 1:01 AM
> > To: Sage Weil
> > Cc: ceph-devel
> > Subject: Re: Initial newstore vs filestore results
> >
> > On 04/08/2015 10:19 PM, Mark Nelson wrote:
> >> On 04/07/2015 09:58 PM, Sage Weil wrote:
> >>> What would be very interesting would be to see the 4KB performance
> >>> with the defaults (newstore overlay max = 32) vs overlays disabled
> >>> (newstore overlay max = 0) and see if/how much it is helping.
> >>
> >> And here we go.  1 OSD, 1X replication.  16GB RBD volume.
> >>
> >> 4MB        write    read    randw    randr
> >> default overlay    36.13    106.61    34.49    92.69
> >> no overlay    36.29    105.61    34.49    93.55
> >>
> >> 128KB        write    read    randw    randr
> >> default overlay    1.71    97.90    1.65    25.79
> >> no overlay    1.72    97.80    1.66    25.78
> >>
> >> 4KB        write    read    randw    randr
> >> default overlay    0.40    61.88    1.29    1.11
> >> no overlay    0.05    61.26    0.05    1.10
> >>
> >
> > Update this morning.  Also ran filestore tests for comparison.  Next we'll look at how tweaking the overlay for different IO sizes affects things.  IE the overlay threshold is 64k right now and it appears that 128K write IOs for instance are quite a bit worse with newstore currently than with filestore.  Sage also just committed changes that will allow overlay writes during append/create which may help improve small IO write performance as well in some cases.
> >
> > 4MB             write   read    randw   randr
> > default overlay 36.13   106.61  34.49   92.69
> > no overlay      36.29   105.61  34.49   93.55
> > filestore       36.17   84.59   34.11   79.85
> >
> > 128KB           write   read    randw   randr
> > default overlay 1.71    97.90   1.65    25.79
> > no overlay      1.72    97.80   1.66    25.78
> > filestore       27.15   79.91   8.77    19.00
> >
> > 4KB             write   read    randw   randr
> > default overlay 0.40    61.88   1.29    1.11
> > no overlay      0.05    61.26   0.05    1.10
> > filestore       4.14    56.30   0.42    0.76
> >
> > Seekwatcher movies and graphs available here:
> >
> > http://nhm.ceph.com/newstore/20150408/
> >
> > Note for instance the very interesting blktrace patterns for 4K random writes on the OSD in each case:
> >
> > http://nhm.ceph.com/newstore/20150408/filestore/RBD_00004096_randwrite.png
> > http://nhm.ceph.com/newstore/20150408/default_overlay/RBD_00004096_randwrite.png
> > http://nhm.ceph.com/newstore/20150408/no_overlay/RBD_00004096_randwrite.png
> >
> > Mark
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Initial newstore vs filestore results
  2015-04-10 15:28               ` Sage Weil
@ 2015-04-10 15:53                 ` Mark Nelson
  2015-04-10 19:41                   ` Mark Nelson
  0 siblings, 1 reply; 28+ messages in thread
From: Mark Nelson @ 2015-04-10 15:53 UTC (permalink / raw)
  To: Sage Weil, Ning Yao; +Cc: Duan, Jiangang, ceph-devel

[-- Attachment #1: Type: text/plain, Size: 4569 bytes --]

Test results attached for different overlay settings at various IO sizes 
for writes and random writes.  Basically it looks like as we increase 
the overlay size it changes the curve.  So far we're still not doing as 
good as the filestore (co-located journal) though.

I imagine the WAL probably does play a big part here.

Mark

On 04/10/2015 10:28 AM, Sage Weil wrote:
> On Fri, 10 Apr 2015, Ning Yao wrote:
>> KV store introduces too much write amplification, we may need
>> self-implemented WAL?
>
> What we really want is to hint to the kv store that these keys (or this
> key range) is short-lived and should never get compacted.  And/or, we need
> to just make sure the wal is sufficiently large so that in practice that
> never happens to those keys.
>
> Putting them outside the kv store means an additional seek/sync for disks,
> which defeats most of the purpose.  Maybe it makes sense for flash... but
> the above avoids the problem in either case.
>
> I think we should target rocksdb for our initial tuning attempts.  So far
> all I've done is played a bit with the file size (1mb -> 4mb -> 8mb)
> but my ad hoc tests didn't see much difference.
>
> sage
>
>
>
>> Regards
>> Ning Yao
>>
>>
>> 2015-04-10 14:11 GMT+08:00 Duan, Jiangang <jiangang.duan@intel.com>:
>>> IMHO, the newstore performance depends so much on KV store performance due to the WAL -  so pick up the right KV or tune it will be the 1st step to do.
>>>
>>> -jiangang
>>>
>>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>> Sent: Friday, April 10, 2015 1:01 AM
>>> To: Sage Weil
>>> Cc: ceph-devel
>>> Subject: Re: Initial newstore vs filestore results
>>>
>>> On 04/08/2015 10:19 PM, Mark Nelson wrote:
>>>> On 04/07/2015 09:58 PM, Sage Weil wrote:
>>>>> What would be very interesting would be to see the 4KB performance
>>>>> with the defaults (newstore overlay max = 32) vs overlays disabled
>>>>> (newstore overlay max = 0) and see if/how much it is helping.
>>>>
>>>> And here we go.  1 OSD, 1X replication.  16GB RBD volume.
>>>>
>>>> 4MB        write    read    randw    randr
>>>> default overlay    36.13    106.61    34.49    92.69
>>>> no overlay    36.29    105.61    34.49    93.55
>>>>
>>>> 128KB        write    read    randw    randr
>>>> default overlay    1.71    97.90    1.65    25.79
>>>> no overlay    1.72    97.80    1.66    25.78
>>>>
>>>> 4KB        write    read    randw    randr
>>>> default overlay    0.40    61.88    1.29    1.11
>>>> no overlay    0.05    61.26    0.05    1.10
>>>>
>>>
>>> Update this morning.  Also ran filestore tests for comparison.  Next we'll look at how tweaking the overlay for different IO sizes affects things.  IE the overlay threshold is 64k right now and it appears that 128K write IOs for instance are quite a bit worse with newstore currently than with filestore.  Sage also just committed changes that will allow overlay writes during append/create which may help improve small IO write performance as well in some cases.
>>>
>>> 4MB             write   read    randw   randr
>>> default overlay 36.13   106.61  34.49   92.69
>>> no overlay      36.29   105.61  34.49   93.55
>>> filestore       36.17   84.59   34.11   79.85
>>>
>>> 128KB           write   read    randw   randr
>>> default overlay 1.71    97.90   1.65    25.79
>>> no overlay      1.72    97.80   1.66    25.78
>>> filestore       27.15   79.91   8.77    19.00
>>>
>>> 4KB             write   read    randw   randr
>>> default overlay 0.40    61.88   1.29    1.11
>>> no overlay      0.05    61.26   0.05    1.10
>>> filestore       4.14    56.30   0.42    0.76
>>>
>>> Seekwatcher movies and graphs available here:
>>>
>>> http://nhm.ceph.com/newstore/20150408/
>>>
>>> Note for instance the very interesting blktrace patterns for 4K random writes on the OSD in each case:
>>>
>>> http://nhm.ceph.com/newstore/20150408/filestore/RBD_00004096_randwrite.png
>>> http://nhm.ceph.com/newstore/20150408/default_overlay/RBD_00004096_randwrite.png
>>> http://nhm.ceph.com/newstore/20150408/no_overlay/RBD_00004096_randwrite.png
>>>
>>> Mark
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

[-- Attachment #2: filestore_vs_overlay.pdf --]
[-- Type: application/pdf, Size: 51599 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Initial newstore vs filestore results
  2015-04-10 15:53                 ` Mark Nelson
@ 2015-04-10 19:41                   ` Mark Nelson
  2015-04-10 20:04                     ` Mark Nelson
  0 siblings, 1 reply; 28+ messages in thread
From: Mark Nelson @ 2015-04-10 19:41 UTC (permalink / raw)
  To: Sage Weil, Ning Yao; +Cc: Duan, Jiangang, ceph-devel

Seekwatcher movies and graphs finally finished generating for all of the 
tests:

http://nhm.ceph.com/newstore/20150409/

Mark

On 04/10/2015 10:53 AM, Mark Nelson wrote:
> Test results attached for different overlay settings at various IO sizes
> for writes and random writes.  Basically it looks like as we increase
> the overlay size it changes the curve.  So far we're still not doing as
> good as the filestore (co-located journal) though.
>
> I imagine the WAL probably does play a big part here.
>
> Mark
>
> On 04/10/2015 10:28 AM, Sage Weil wrote:
>> On Fri, 10 Apr 2015, Ning Yao wrote:
>>> KV store introduces too much write amplification, we may need
>>> self-implemented WAL?
>>
>> What we really want is to hint to the kv store that these keys (or this
>> key range) is short-lived and should never get compacted.  And/or, we
>> need
>> to just make sure the wal is sufficiently large so that in practice that
>> never happens to those keys.
>>
>> Putting them outside the kv store means an additional seek/sync for
>> disks,
>> which defeats most of the purpose.  Maybe it makes sense for flash... but
>> the above avoids the problem in either case.
>>
>> I think we should target rocksdb for our initial tuning attempts.  So far
>> all I've done is played a bit with the file size (1mb -> 4mb -> 8mb)
>> but my ad hoc tests didn't see much difference.
>>
>> sage
>>
>>
>>
>>> Regards
>>> Ning Yao
>>>
>>>
>>> 2015-04-10 14:11 GMT+08:00 Duan, Jiangang <jiangang.duan@intel.com>:
>>>> IMHO, the newstore performance depends so much on KV store
>>>> performance due to the WAL -  so pick up the right KV or tune it
>>>> will be the 1st step to do.
>>>>
>>>> -jiangang
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: ceph-devel-owner@vger.kernel.org
>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>>> Sent: Friday, April 10, 2015 1:01 AM
>>>> To: Sage Weil
>>>> Cc: ceph-devel
>>>> Subject: Re: Initial newstore vs filestore results
>>>>
>>>> On 04/08/2015 10:19 PM, Mark Nelson wrote:
>>>>> On 04/07/2015 09:58 PM, Sage Weil wrote:
>>>>>> What would be very interesting would be to see the 4KB performance
>>>>>> with the defaults (newstore overlay max = 32) vs overlays disabled
>>>>>> (newstore overlay max = 0) and see if/how much it is helping.
>>>>>
>>>>> And here we go.  1 OSD, 1X replication.  16GB RBD volume.
>>>>>
>>>>> 4MB        write    read    randw    randr
>>>>> default overlay    36.13    106.61    34.49    92.69
>>>>> no overlay    36.29    105.61    34.49    93.55
>>>>>
>>>>> 128KB        write    read    randw    randr
>>>>> default overlay    1.71    97.90    1.65    25.79
>>>>> no overlay    1.72    97.80    1.66    25.78
>>>>>
>>>>> 4KB        write    read    randw    randr
>>>>> default overlay    0.40    61.88    1.29    1.11
>>>>> no overlay    0.05    61.26    0.05    1.10
>>>>>
>>>>
>>>> Update this morning.  Also ran filestore tests for comparison.  Next
>>>> we'll look at how tweaking the overlay for different IO sizes
>>>> affects things.  IE the overlay threshold is 64k right now and it
>>>> appears that 128K write IOs for instance are quite a bit worse with
>>>> newstore currently than with filestore.  Sage also just committed
>>>> changes that will allow overlay writes during append/create which
>>>> may help improve small IO write performance as well in some cases.
>>>>
>>>> 4MB             write   read    randw   randr
>>>> default overlay 36.13   106.61  34.49   92.69
>>>> no overlay      36.29   105.61  34.49   93.55
>>>> filestore       36.17   84.59   34.11   79.85
>>>>
>>>> 128KB           write   read    randw   randr
>>>> default overlay 1.71    97.90   1.65    25.79
>>>> no overlay      1.72    97.80   1.66    25.78
>>>> filestore       27.15   79.91   8.77    19.00
>>>>
>>>> 4KB             write   read    randw   randr
>>>> default overlay 0.40    61.88   1.29    1.11
>>>> no overlay      0.05    61.26   0.05    1.10
>>>> filestore       4.14    56.30   0.42    0.76
>>>>
>>>> Seekwatcher movies and graphs available here:
>>>>
>>>> http://nhm.ceph.com/newstore/20150408/
>>>>
>>>> Note for instance the very interesting blktrace patterns for 4K
>>>> random writes on the OSD in each case:
>>>>
>>>> http://nhm.ceph.com/newstore/20150408/filestore/RBD_00004096_randwrite.png
>>>>
>>>> http://nhm.ceph.com/newstore/20150408/default_overlay/RBD_00004096_randwrite.png
>>>>
>>>> http://nhm.ceph.com/newstore/20150408/no_overlay/RBD_00004096_randwrite.png
>>>>
>>>>
>>>> Mark
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe
>>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe
>>>> ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Initial newstore vs filestore results
  2015-04-10 19:41                   ` Mark Nelson
@ 2015-04-10 20:04                     ` Mark Nelson
  2015-04-10 23:24                       ` Sage Weil
  2015-04-10 23:43                       ` Duan, Jiangang
  0 siblings, 2 replies; 28+ messages in thread
From: Mark Nelson @ 2015-04-10 20:04 UTC (permalink / raw)
  To: Sage Weil, Ning Yao; +Cc: Duan, Jiangang, ceph-devel

Notice for instance a comparison of random 512k writes between 
filestore, newstore with no overlay, and newstore with 8m overlay:

http://nhm.ceph.com/newstore/20150409/filestore/RBD_00524288_randwrite.png
http://nhm.ceph.com/newstore/20150409/8m_overlay/RBD_00524288_randwrite.png
http://nhm.ceph.com/newstore/20150409/no_overlay/RBD_00524288_randwrite.png

The client rbd throughput as reported by fio is:

filestore: 20.44MB/s
newstore+no_overlay: 4.35MB/s
newstore+8m_overlay: 3.86MB/s

But notice that in the graphs, we see very different behaviors on disk.

Filestore does a lot of reads and writes to a couple of specific 
portions of the device and has peaks/valleys when data gets written out 
in bulk.  I would have expected to see more sequential looking writes 
during the peaks due to journal writes and no reads to that portion of 
the disk, but it seems murkier to me than that.

http://nhm.ceph.com/newstore/20150409/filestore/RBD_00524288_randwrite_OSD0.mpg

newstore+no_overlay does kind of a flurry of random IO and looks like 
it's somewhat seek bound.  It's very consistent but actual write 
performance is low compared to what blktrace reports as the data hitting 
the disk.  Something happening toward the beginning of the drive too.

http://nhm.ceph.com/newstore/20150409/no_overlay/RBD_00524288_randwrite_OSD0.mpg

newstore+8m overlay is interesting.  Lots of data gets written out to 
the disk in seemingly large chunks but the actual throughput as reported 
by the client is very slow.  I assume there's tons of write 
amplification happening as rocksdb moves the 512k objects around into 
different levels.

http://nhm.ceph.com/newstore/20150409/8m_overlay/RBD_00524288_randwrite_OSD0.mpg

Mark

On 04/10/2015 02:41 PM, Mark Nelson wrote:
> Seekwatcher movies and graphs finally finished generating for all of the
> tests:
>
> http://nhm.ceph.com/newstore/20150409/
>
> Mark
>
> On 04/10/2015 10:53 AM, Mark Nelson wrote:
>> Test results attached for different overlay settings at various IO sizes
>> for writes and random writes.  Basically it looks like as we increase
>> the overlay size it changes the curve.  So far we're still not doing as
>> good as the filestore (co-located journal) though.
>>
>> I imagine the WAL probably does play a big part here.
>>
>> Mark
>>
>> On 04/10/2015 10:28 AM, Sage Weil wrote:
>>> On Fri, 10 Apr 2015, Ning Yao wrote:
>>>> KV store introduces too much write amplification, we may need
>>>> self-implemented WAL?
>>>
>>> What we really want is to hint to the kv store that these keys (or this
>>> key range) is short-lived and should never get compacted.  And/or, we
>>> need
>>> to just make sure the wal is sufficiently large so that in practice that
>>> never happens to those keys.
>>>
>>> Putting them outside the kv store means an additional seek/sync for
>>> disks,
>>> which defeats most of the purpose.  Maybe it makes sense for flash...
>>> but
>>> the above avoids the problem in either case.
>>>
>>> I think we should target rocksdb for our initial tuning attempts.  So
>>> far
>>> all I've done is played a bit with the file size (1mb -> 4mb -> 8mb)
>>> but my ad hoc tests didn't see much difference.
>>>
>>> sage
>>>
>>>
>>>
>>>> Regards
>>>> Ning Yao
>>>>
>>>>
>>>> 2015-04-10 14:11 GMT+08:00 Duan, Jiangang <jiangang.duan@intel.com>:
>>>>> IMHO, the newstore performance depends so much on KV store
>>>>> performance due to the WAL -  so pick up the right KV or tune it
>>>>> will be the 1st step to do.
>>>>>
>>>>> -jiangang
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>>>> Sent: Friday, April 10, 2015 1:01 AM
>>>>> To: Sage Weil
>>>>> Cc: ceph-devel
>>>>> Subject: Re: Initial newstore vs filestore results
>>>>>
>>>>> On 04/08/2015 10:19 PM, Mark Nelson wrote:
>>>>>> On 04/07/2015 09:58 PM, Sage Weil wrote:
>>>>>>> What would be very interesting would be to see the 4KB performance
>>>>>>> with the defaults (newstore overlay max = 32) vs overlays disabled
>>>>>>> (newstore overlay max = 0) and see if/how much it is helping.
>>>>>>
>>>>>> And here we go.  1 OSD, 1X replication.  16GB RBD volume.
>>>>>>
>>>>>> 4MB        write    read    randw    randr
>>>>>> default overlay    36.13    106.61    34.49    92.69
>>>>>> no overlay    36.29    105.61    34.49    93.55
>>>>>>
>>>>>> 128KB        write    read    randw    randr
>>>>>> default overlay    1.71    97.90    1.65    25.79
>>>>>> no overlay    1.72    97.80    1.66    25.78
>>>>>>
>>>>>> 4KB        write    read    randw    randr
>>>>>> default overlay    0.40    61.88    1.29    1.11
>>>>>> no overlay    0.05    61.26    0.05    1.10
>>>>>>
>>>>>
>>>>> Update this morning.  Also ran filestore tests for comparison.  Next
>>>>> we'll look at how tweaking the overlay for different IO sizes
>>>>> affects things.  IE the overlay threshold is 64k right now and it
>>>>> appears that 128K write IOs for instance are quite a bit worse with
>>>>> newstore currently than with filestore.  Sage also just committed
>>>>> changes that will allow overlay writes during append/create which
>>>>> may help improve small IO write performance as well in some cases.
>>>>>
>>>>> 4MB             write   read    randw   randr
>>>>> default overlay 36.13   106.61  34.49   92.69
>>>>> no overlay      36.29   105.61  34.49   93.55
>>>>> filestore       36.17   84.59   34.11   79.85
>>>>>
>>>>> 128KB           write   read    randw   randr
>>>>> default overlay 1.71    97.90   1.65    25.79
>>>>> no overlay      1.72    97.80   1.66    25.78
>>>>> filestore       27.15   79.91   8.77    19.00
>>>>>
>>>>> 4KB             write   read    randw   randr
>>>>> default overlay 0.40    61.88   1.29    1.11
>>>>> no overlay      0.05    61.26   0.05    1.10
>>>>> filestore       4.14    56.30   0.42    0.76
>>>>>
>>>>> Seekwatcher movies and graphs available here:
>>>>>
>>>>> http://nhm.ceph.com/newstore/20150408/
>>>>>
>>>>> Note for instance the very interesting blktrace patterns for 4K
>>>>> random writes on the OSD in each case:
>>>>>
>>>>> http://nhm.ceph.com/newstore/20150408/filestore/RBD_00004096_randwrite.png
>>>>>
>>>>>
>>>>> http://nhm.ceph.com/newstore/20150408/default_overlay/RBD_00004096_randwrite.png
>>>>>
>>>>>
>>>>> http://nhm.ceph.com/newstore/20150408/no_overlay/RBD_00004096_randwrite.png
>>>>>
>>>>>
>>>>>
>>>>> Mark
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>> ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Initial newstore vs filestore results
  2015-04-10 20:04                     ` Mark Nelson
@ 2015-04-10 23:24                       ` Sage Weil
  2015-04-10 23:44                         ` Duan, Jiangang
  2015-04-10 23:43                       ` Duan, Jiangang
  1 sibling, 1 reply; 28+ messages in thread
From: Sage Weil @ 2015-04-10 23:24 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Ning Yao, Duan, Jiangang, ceph-devel

On Fri, 10 Apr 2015, Mark Nelson wrote:
> Notice for instance a comparison of random 512k writes between filestore,
> newstore with no overlay, and newstore with 8m overlay:
> 
> http://nhm.ceph.com/newstore/20150409/filestore/RBD_00524288_randwrite.png
> http://nhm.ceph.com/newstore/20150409/8m_overlay/RBD_00524288_randwrite.png
> http://nhm.ceph.com/newstore/20150409/no_overlay/RBD_00524288_randwrite.png
> 
> The client rbd throughput as reported by fio is:
> 
> filestore: 20.44MB/s
> newstore+no_overlay: 4.35MB/s
> newstore+8m_overlay: 3.86MB/s
> 
> But notice that in the graphs, we see very different behaviors on disk.
> 
> Filestore does a lot of reads and writes to a couple of specific portions of
> the device and has peaks/valleys when data gets written out in bulk.  I would
> have expected to see more sequential looking writes during the peaks due to
> journal writes and no reads to that portion of the disk, but it seems murkier
> to me than that.
> 
> http://nhm.ceph.com/newstore/20150409/filestore/RBD_00524288_randwrite_OSD0.mpg
> 
> newstore+no_overlay does kind of a flurry of random IO and looks like it's
> somewhat seek bound.  It's very consistent but actual write performance is low
> compared to what blktrace reports as the data hitting the disk.  Something
> happening toward the beginning of the drive too.
> 
> http://nhm.ceph.com/newstore/20150409/no_overlay/RBD_00524288_randwrite_OSD0.mpg

Yeah, looks like a bunch of write amplication... the disk bw used is 
really high.  I think we need to look at what rocksdb is doing here.  A 
couple things:

 - Make the log bigger, if we can, so that short-lived WAL keys don't get 
amplified.  We'd rather eat memory than rewrite them in an sst since the 
number of them in flight is pretty well bounded.

 - The rocksdb log as it stands isn't ever going to perform as well as the 
FileJournal currently does.  The FileJouranl uses a fixed-size device or 
file that's preallocated with no 'size' associated with it, so that when 
there is a write we only have to push down the data blocks (one seek), and 
on replay can identify valid records with a seq # and checksum.  
Rocksdb's log is a .log file that grows and get's fsync(2)'d, which means 
that the data blocks have to hit the disk *and* the inode (size) needs to 
get updated for the commit to happen.  We could improve this by doing a 
fallocate and turning it into a circular buffer.  I'm not sure XFS will 
let us fallocate a fresh file of 0's though and avoid a second seek 
because it'll still need to flip the extent bits when the data blocks are 
written... or prefill the file with 0's before using it.  :/

sage


> 
> newstore+8m overlay is interesting.  Lots of data gets written out to the disk
> in seemingly large chunks but the actual throughput as reported by the client
> is very slow.  I assume there's tons of write amplification happening as
> rocksdb moves the 512k objects around into different levels.
> 
> http://nhm.ceph.com/newstore/20150409/8m_overlay/RBD_00524288_randwrite_OSD0.mpg
> 
> Mark
> 
> On 04/10/2015 02:41 PM, Mark Nelson wrote:
> > Seekwatcher movies and graphs finally finished generating for all of the
> > tests:
> > 
> > http://nhm.ceph.com/newstore/20150409/
> > 
> > Mark
> > 
> > On 04/10/2015 10:53 AM, Mark Nelson wrote:
> > > Test results attached for different overlay settings at various IO sizes
> > > for writes and random writes.  Basically it looks like as we increase
> > > the overlay size it changes the curve.  So far we're still not doing as
> > > good as the filestore (co-located journal) though.
> > > 
> > > I imagine the WAL probably does play a big part here.
> > > 
> > > Mark
> > > 
> > > On 04/10/2015 10:28 AM, Sage Weil wrote:
> > > > On Fri, 10 Apr 2015, Ning Yao wrote:
> > > > > KV store introduces too much write amplification, we may need
> > > > > self-implemented WAL?
> > > > 
> > > > What we really want is to hint to the kv store that these keys (or this
> > > > key range) is short-lived and should never get compacted.  And/or, we
> > > > need
> > > > to just make sure the wal is sufficiently large so that in practice that
> > > > never happens to those keys.
> > > > 
> > > > Putting them outside the kv store means an additional seek/sync for
> > > > disks,
> > > > which defeats most of the purpose.  Maybe it makes sense for flash...
> > > > but
> > > > the above avoids the problem in either case.
> > > > 
> > > > I think we should target rocksdb for our initial tuning attempts.  So
> > > > far
> > > > all I've done is played a bit with the file size (1mb -> 4mb -> 8mb)
> > > > but my ad hoc tests didn't see much difference.
> > > > 
> > > > sage
> > > > 
> > > > 
> > > > 
> > > > > Regards
> > > > > Ning Yao
> > > > > 
> > > > > 
> > > > > 2015-04-10 14:11 GMT+08:00 Duan, Jiangang <jiangang.duan@intel.com>:
> > > > > > IMHO, the newstore performance depends so much on KV store
> > > > > > performance due to the WAL -  so pick up the right KV or tune it
> > > > > > will be the 1st step to do.
> > > > > > 
> > > > > > -jiangang
> > > > > > 
> > > > > > 
> > > > > > -----Original Message-----
> > > > > > From: ceph-devel-owner@vger.kernel.org
> > > > > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> > > > > > Sent: Friday, April 10, 2015 1:01 AM
> > > > > > To: Sage Weil
> > > > > > Cc: ceph-devel
> > > > > > Subject: Re: Initial newstore vs filestore results
> > > > > > 
> > > > > > On 04/08/2015 10:19 PM, Mark Nelson wrote:
> > > > > > > On 04/07/2015 09:58 PM, Sage Weil wrote:
> > > > > > > > What would be very interesting would be to see the 4KB
> > > > > > > > performance
> > > > > > > > with the defaults (newstore overlay max = 32) vs overlays
> > > > > > > > disabled
> > > > > > > > (newstore overlay max = 0) and see if/how much it is helping.
> > > > > > > 
> > > > > > > And here we go.  1 OSD, 1X replication.  16GB RBD volume.
> > > > > > > 
> > > > > > > 4MB        write    read    randw    randr
> > > > > > > default overlay    36.13    106.61    34.49    92.69
> > > > > > > no overlay    36.29    105.61    34.49    93.55
> > > > > > > 
> > > > > > > 128KB        write    read    randw    randr
> > > > > > > default overlay    1.71    97.90    1.65    25.79
> > > > > > > no overlay    1.72    97.80    1.66    25.78
> > > > > > > 
> > > > > > > 4KB        write    read    randw    randr
> > > > > > > default overlay    0.40    61.88    1.29    1.11
> > > > > > > no overlay    0.05    61.26    0.05    1.10
> > > > > > > 
> > > > > > 
> > > > > > Update this morning.  Also ran filestore tests for comparison.  Next
> > > > > > we'll look at how tweaking the overlay for different IO sizes
> > > > > > affects things.  IE the overlay threshold is 64k right now and it
> > > > > > appears that 128K write IOs for instance are quite a bit worse with
> > > > > > newstore currently than with filestore.  Sage also just committed
> > > > > > changes that will allow overlay writes during append/create which
> > > > > > may help improve small IO write performance as well in some cases.
> > > > > > 
> > > > > > 4MB             write   read    randw   randr
> > > > > > default overlay 36.13   106.61  34.49   92.69
> > > > > > no overlay      36.29   105.61  34.49   93.55
> > > > > > filestore       36.17   84.59   34.11   79.85
> > > > > > 
> > > > > > 128KB           write   read    randw   randr
> > > > > > default overlay 1.71    97.90   1.65    25.79
> > > > > > no overlay      1.72    97.80   1.66    25.78
> > > > > > filestore       27.15   79.91   8.77    19.00
> > > > > > 
> > > > > > 4KB             write   read    randw   randr
> > > > > > default overlay 0.40    61.88   1.29    1.11
> > > > > > no overlay      0.05    61.26   0.05    1.10
> > > > > > filestore       4.14    56.30   0.42    0.76
> > > > > > 
> > > > > > Seekwatcher movies and graphs available here:
> > > > > > 
> > > > > > http://nhm.ceph.com/newstore/20150408/
> > > > > > 
> > > > > > Note for instance the very interesting blktrace patterns for 4K
> > > > > > random writes on the OSD in each case:
> > > > > > 
> > > > > > http://nhm.ceph.com/newstore/20150408/filestore/RBD_00004096_randwrite.png
> > > > > > 
> > > > > > 
> > > > > > http://nhm.ceph.com/newstore/20150408/default_overlay/RBD_00004096_randwrite.png
> > > > > > 
> > > > > > 
> > > > > > http://nhm.ceph.com/newstore/20150408/no_overlay/RBD_00004096_randwrite.png
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > Mark
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe
> > > > > > ceph-devel" in the body of a message to majordomo@vger.kernel.org
> > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe
> > > > > > ceph-devel" in
> > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > 
> > > > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Initial newstore vs filestore results
  2015-04-10 20:04                     ` Mark Nelson
  2015-04-10 23:24                       ` Sage Weil
@ 2015-04-10 23:43                       ` Duan, Jiangang
  2015-04-11  0:09                         ` Mark Nelson
  1 sibling, 1 reply; 28+ messages in thread
From: Duan, Jiangang @ 2015-04-10 23:43 UTC (permalink / raw)
  To: Mark Nelson, Sage Weil, Ning Yao; +Cc: ceph-devel

Mark, What is the workload pattern for below data? Small IO or big IO? New file or in-place update in RBD?

Filestore does a lot of reads and writes to a couple of specific portions of the device and has peaks/valleys when data gets written out in bulk.  I would have expected to see more sequential looking writes during the peaks due to journal writes and no reads to that portion of the disk, but it seems murkier to me than that.

http://nhm.ceph.com/newstore/20150409/filestore/RBD_00524288_randwrite_OSD0.mpg

newstore+no_overlay does kind of a flurry of random IO and looks like
it's somewhat seek bound.  It's very consistent but actual write performance is low compared to what blktrace reports as the data hitting the disk.  Something happening toward the beginning of the drive too.

http://nhm.ceph.com/newstore/20150409/no_overlay/RBD_00524288_randwrite_OSD0.mpg

newstore+8m overlay is interesting.  Lots of data gets written out to
the disk in seemingly large chunks but the actual throughput as reported by the client is very slow.  I assume there's tons of write amplification happening as rocksdb moves the 512k objects around into different levels.

http://nhm.ceph.com/newstore/20150409/8m_overlay/RBD_00524288_randwrite_OSD0.mpg


-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Saturday, April 11, 2015 4:05 AM
To: Sage Weil; Ning Yao
Cc: Duan, Jiangang; ceph-devel
Subject: Re: Initial newstore vs filestore results

Notice for instance a comparison of random 512k writes between filestore, newstore with no overlay, and newstore with 8m overlay:

http://nhm.ceph.com/newstore/20150409/filestore/RBD_00524288_randwrite.png
http://nhm.ceph.com/newstore/20150409/8m_overlay/RBD_00524288_randwrite.png
http://nhm.ceph.com/newstore/20150409/no_overlay/RBD_00524288_randwrite.png

The client rbd throughput as reported by fio is:

filestore: 20.44MB/s
newstore+no_overlay: 4.35MB/s
newstore+8m_overlay: 3.86MB/s

But notice that in the graphs, we see very different behaviors on disk.

Filestore does a lot of reads and writes to a couple of specific portions of the device and has peaks/valleys when data gets written out in bulk.  I would have expected to see more sequential looking writes during the peaks due to journal writes and no reads to that portion of the disk, but it seems murkier to me than that.

http://nhm.ceph.com/newstore/20150409/filestore/RBD_00524288_randwrite_OSD0.mpg

newstore+no_overlay does kind of a flurry of random IO and looks like
it's somewhat seek bound.  It's very consistent but actual write performance is low compared to what blktrace reports as the data hitting the disk.  Something happening toward the beginning of the drive too.

http://nhm.ceph.com/newstore/20150409/no_overlay/RBD_00524288_randwrite_OSD0.mpg

newstore+8m overlay is interesting.  Lots of data gets written out to
the disk in seemingly large chunks but the actual throughput as reported by the client is very slow.  I assume there's tons of write amplification happening as rocksdb moves the 512k objects around into different levels.

http://nhm.ceph.com/newstore/20150409/8m_overlay/RBD_00524288_randwrite_OSD0.mpg

Mark

On 04/10/2015 02:41 PM, Mark Nelson wrote:
> Seekwatcher movies and graphs finally finished generating for all of 
> the
> tests:
>
> http://nhm.ceph.com/newstore/20150409/
>
> Mark
>
> On 04/10/2015 10:53 AM, Mark Nelson wrote:
>> Test results attached for different overlay settings at various IO 
>> sizes for writes and random writes.  Basically it looks like as we 
>> increase the overlay size it changes the curve.  So far we're still 
>> not doing as good as the filestore (co-located journal) though.
>>
>> I imagine the WAL probably does play a big part here.
>>
>> Mark
>>
>> On 04/10/2015 10:28 AM, Sage Weil wrote:
>>> On Fri, 10 Apr 2015, Ning Yao wrote:
>>>> KV store introduces too much write amplification, we may need 
>>>> self-implemented WAL?
>>>
>>> What we really want is to hint to the kv store that these keys (or 
>>> this key range) is short-lived and should never get compacted.  
>>> And/or, we need to just make sure the wal is sufficiently large so 
>>> that in practice that never happens to those keys.
>>>
>>> Putting them outside the kv store means an additional seek/sync for 
>>> disks, which defeats most of the purpose.  Maybe it makes sense for 
>>> flash...
>>> but
>>> the above avoids the problem in either case.
>>>
>>> I think we should target rocksdb for our initial tuning attempts.  
>>> So far all I've done is played a bit with the file size (1mb -> 4mb 
>>> -> 8mb) but my ad hoc tests didn't see much difference.
>>>
>>> sage
>>>
>>>
>>>
>>>> Regards
>>>> Ning Yao
>>>>
>>>>
>>>> 2015-04-10 14:11 GMT+08:00 Duan, Jiangang <jiangang.duan@intel.com>:
>>>>> IMHO, the newstore performance depends so much on KV store 
>>>>> performance due to the WAL -  so pick up the right KV or tune it 
>>>>> will be the 1st step to do.
>>>>>
>>>>> -jiangang
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: ceph-devel-owner@vger.kernel.org 
>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>>>> Sent: Friday, April 10, 2015 1:01 AM
>>>>> To: Sage Weil
>>>>> Cc: ceph-devel
>>>>> Subject: Re: Initial newstore vs filestore results
>>>>>
>>>>> On 04/08/2015 10:19 PM, Mark Nelson wrote:
>>>>>> On 04/07/2015 09:58 PM, Sage Weil wrote:
>>>>>>> What would be very interesting would be to see the 4KB 
>>>>>>> performance with the defaults (newstore overlay max = 32) vs 
>>>>>>> overlays disabled (newstore overlay max = 0) and see if/how much it is helping.
>>>>>>
>>>>>> And here we go.  1 OSD, 1X replication.  16GB RBD volume.
>>>>>>
>>>>>> 4MB        write    read    randw    randr
>>>>>> default overlay    36.13    106.61    34.49    92.69
>>>>>> no overlay    36.29    105.61    34.49    93.55
>>>>>>
>>>>>> 128KB        write    read    randw    randr
>>>>>> default overlay    1.71    97.90    1.65    25.79
>>>>>> no overlay    1.72    97.80    1.66    25.78
>>>>>>
>>>>>> 4KB        write    read    randw    randr
>>>>>> default overlay    0.40    61.88    1.29    1.11
>>>>>> no overlay    0.05    61.26    0.05    1.10
>>>>>>
>>>>>
>>>>> Update this morning.  Also ran filestore tests for comparison.  
>>>>> Next we'll look at how tweaking the overlay for different IO sizes 
>>>>> affects things.  IE the overlay threshold is 64k right now and it 
>>>>> appears that 128K write IOs for instance are quite a bit worse 
>>>>> with newstore currently than with filestore.  Sage also just 
>>>>> committed changes that will allow overlay writes during 
>>>>> append/create which may help improve small IO write performance as well in some cases.
>>>>>
>>>>> 4MB             write   read    randw   randr
>>>>> default overlay 36.13   106.61  34.49   92.69
>>>>> no overlay      36.29   105.61  34.49   93.55
>>>>> filestore       36.17   84.59   34.11   79.85
>>>>>
>>>>> 128KB           write   read    randw   randr
>>>>> default overlay 1.71    97.90   1.65    25.79
>>>>> no overlay      1.72    97.80   1.66    25.78
>>>>> filestore       27.15   79.91   8.77    19.00
>>>>>
>>>>> 4KB             write   read    randw   randr
>>>>> default overlay 0.40    61.88   1.29    1.11
>>>>> no overlay      0.05    61.26   0.05    1.10
>>>>> filestore       4.14    56.30   0.42    0.76
>>>>>
>>>>> Seekwatcher movies and graphs available here:
>>>>>
>>>>> http://nhm.ceph.com/newstore/20150408/
>>>>>
>>>>> Note for instance the very interesting blktrace patterns for 4K 
>>>>> random writes on the OSD in each case:
>>>>>
>>>>> http://nhm.ceph.com/newstore/20150408/filestore/RBD_00004096_randw
>>>>> rite.png
>>>>>
>>>>>
>>>>> http://nhm.ceph.com/newstore/20150408/default_overlay/RBD_00004096
>>>>> _randwrite.png
>>>>>
>>>>>
>>>>> http://nhm.ceph.com/newstore/20150408/no_overlay/RBD_00004096_rand
>>>>> write.png
>>>>>
>>>>>
>>>>>
>>>>> Mark
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe 
>>>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org 
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe 
>>>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org 
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Initial newstore vs filestore results
  2015-04-10 23:24                       ` Sage Weil
@ 2015-04-10 23:44                         ` Duan, Jiangang
  2015-04-10 23:58                           ` Mark Nelson
  0 siblings, 1 reply; 28+ messages in thread
From: Duan, Jiangang @ 2015-04-10 23:44 UTC (permalink / raw)
  To: Sage Weil, Mark Nelson; +Cc: Ning Yao, ceph-devel

You can try Universal Compaction
https://github.com/facebook/rocksdb/wiki/Universal-Compaction



-----Original Message-----
From: Sage Weil [mailto:sage@newdream.net] 
Sent: Saturday, April 11, 2015 7:24 AM
To: Mark Nelson
Cc: Ning Yao; Duan, Jiangang; ceph-devel
Subject: Re: Initial newstore vs filestore results

On Fri, 10 Apr 2015, Mark Nelson wrote:
> Notice for instance a comparison of random 512k writes between 
> filestore, newstore with no overlay, and newstore with 8m overlay:
> 
> http://nhm.ceph.com/newstore/20150409/filestore/RBD_00524288_randwrite
> .png 
> http://nhm.ceph.com/newstore/20150409/8m_overlay/RBD_00524288_randwrit
> e.png 
> http://nhm.ceph.com/newstore/20150409/no_overlay/RBD_00524288_randwrit
> e.png
> 
> The client rbd throughput as reported by fio is:
> 
> filestore: 20.44MB/s
> newstore+no_overlay: 4.35MB/s
> newstore+8m_overlay: 3.86MB/s
> 
> But notice that in the graphs, we see very different behaviors on disk.
> 
> Filestore does a lot of reads and writes to a couple of specific 
> portions of the device and has peaks/valleys when data gets written 
> out in bulk.  I would have expected to see more sequential looking 
> writes during the peaks due to journal writes and no reads to that 
> portion of the disk, but it seems murkier to me than that.
> 
> http://nhm.ceph.com/newstore/20150409/filestore/RBD_00524288_randwrite
> _OSD0.mpg
> 
> newstore+no_overlay does kind of a flurry of random IO and looks like 
> newstore+it's
> somewhat seek bound.  It's very consistent but actual write 
> performance is low compared to what blktrace reports as the data 
> hitting the disk.  Something happening toward the beginning of the drive too.
> 
> http://nhm.ceph.com/newstore/20150409/no_overlay/RBD_00524288_randwrit
> e_OSD0.mpg

Yeah, looks like a bunch of write amplication... the disk bw used is really high.  I think we need to look at what rocksdb is doing here.  A couple things:

 - Make the log bigger, if we can, so that short-lived WAL keys don't get amplified.  We'd rather eat memory than rewrite them in an sst since the number of them in flight is pretty well bounded.

 - The rocksdb log as it stands isn't ever going to perform as well as the FileJournal currently does.  The FileJouranl uses a fixed-size device or file that's preallocated with no 'size' associated with it, so that when there is a write we only have to push down the data blocks (one seek), and on replay can identify valid records with a seq # and checksum.  
Rocksdb's log is a .log file that grows and get's fsync(2)'d, which means that the data blocks have to hit the disk *and* the inode (size) needs to get updated for the commit to happen.  We could improve this by doing a fallocate and turning it into a circular buffer.  I'm not sure XFS will let us fallocate a fresh file of 0's though and avoid a second seek because it'll still need to flip the extent bits when the data blocks are written... or prefill the file with 0's before using it.  :/

sage


> 
> newstore+8m overlay is interesting.  Lots of data gets written out to 
> newstore+the disk
> in seemingly large chunks but the actual throughput as reported by the 
> client is very slow.  I assume there's tons of write amplification 
> happening as rocksdb moves the 512k objects around into different levels.
> 
> http://nhm.ceph.com/newstore/20150409/8m_overlay/RBD_00524288_randwrit
> e_OSD0.mpg
> 
> Mark
> 
> On 04/10/2015 02:41 PM, Mark Nelson wrote:
> > Seekwatcher movies and graphs finally finished generating for all of 
> > the
> > tests:
> > 
> > http://nhm.ceph.com/newstore/20150409/
> > 
> > Mark
> > 
> > On 04/10/2015 10:53 AM, Mark Nelson wrote:
> > > Test results attached for different overlay settings at various IO 
> > > sizes for writes and random writes.  Basically it looks like as we 
> > > increase the overlay size it changes the curve.  So far we're 
> > > still not doing as good as the filestore (co-located journal) though.
> > > 
> > > I imagine the WAL probably does play a big part here.
> > > 
> > > Mark
> > > 
> > > On 04/10/2015 10:28 AM, Sage Weil wrote:
> > > > On Fri, 10 Apr 2015, Ning Yao wrote:
> > > > > KV store introduces too much write amplification, we may need 
> > > > > self-implemented WAL?
> > > > 
> > > > What we really want is to hint to the kv store that these keys 
> > > > (or this key range) is short-lived and should never get 
> > > > compacted.  And/or, we need to just make sure the wal is 
> > > > sufficiently large so that in practice that never happens to 
> > > > those keys.
> > > > 
> > > > Putting them outside the kv store means an additional seek/sync 
> > > > for disks, which defeats most of the purpose.  Maybe it makes 
> > > > sense for flash...
> > > > but
> > > > the above avoids the problem in either case.
> > > > 
> > > > I think we should target rocksdb for our initial tuning 
> > > > attempts.  So far all I've done is played a bit with the file 
> > > > size (1mb -> 4mb -> 8mb) but my ad hoc tests didn't see much 
> > > > difference.
> > > > 
> > > > sage
> > > > 
> > > > 
> > > > 
> > > > > Regards
> > > > > Ning Yao
> > > > > 
> > > > > 
> > > > > 2015-04-10 14:11 GMT+08:00 Duan, Jiangang <jiangang.duan@intel.com>:
> > > > > > IMHO, the newstore performance depends so much on KV store 
> > > > > > performance due to the WAL -  so pick up the right KV or 
> > > > > > tune it will be the 1st step to do.
> > > > > > 
> > > > > > -jiangang
> > > > > > 
> > > > > > 
> > > > > > -----Original Message-----
> > > > > > From: ceph-devel-owner@vger.kernel.org 
> > > > > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark 
> > > > > > Nelson
> > > > > > Sent: Friday, April 10, 2015 1:01 AM
> > > > > > To: Sage Weil
> > > > > > Cc: ceph-devel
> > > > > > Subject: Re: Initial newstore vs filestore results
> > > > > > 
> > > > > > On 04/08/2015 10:19 PM, Mark Nelson wrote:
> > > > > > > On 04/07/2015 09:58 PM, Sage Weil wrote:
> > > > > > > > What would be very interesting would be to see the 4KB 
> > > > > > > > performance with the defaults (newstore overlay max = 
> > > > > > > > 32) vs overlays disabled (newstore overlay max = 0) and 
> > > > > > > > see if/how much it is helping.
> > > > > > > 
> > > > > > > And here we go.  1 OSD, 1X replication.  16GB RBD volume.
> > > > > > > 
> > > > > > > 4MB        write    read    randw    randr
> > > > > > > default overlay    36.13    106.61    34.49    92.69
> > > > > > > no overlay    36.29    105.61    34.49    93.55
> > > > > > > 
> > > > > > > 128KB        write    read    randw    randr
> > > > > > > default overlay    1.71    97.90    1.65    25.79
> > > > > > > no overlay    1.72    97.80    1.66    25.78
> > > > > > > 
> > > > > > > 4KB        write    read    randw    randr
> > > > > > > default overlay    0.40    61.88    1.29    1.11
> > > > > > > no overlay    0.05    61.26    0.05    1.10
> > > > > > > 
> > > > > > 
> > > > > > Update this morning.  Also ran filestore tests for 
> > > > > > comparison.  Next we'll look at how tweaking the overlay for 
> > > > > > different IO sizes affects things.  IE the overlay threshold 
> > > > > > is 64k right now and it appears that 128K write IOs for 
> > > > > > instance are quite a bit worse with newstore currently than 
> > > > > > with filestore.  Sage also just committed changes that will 
> > > > > > allow overlay writes during append/create which may help improve small IO write performance as well in some cases.
> > > > > > 
> > > > > > 4MB             write   read    randw   randr
> > > > > > default overlay 36.13   106.61  34.49   92.69
> > > > > > no overlay      36.29   105.61  34.49   93.55
> > > > > > filestore       36.17   84.59   34.11   79.85
> > > > > > 
> > > > > > 128KB           write   read    randw   randr
> > > > > > default overlay 1.71    97.90   1.65    25.79
> > > > > > no overlay      1.72    97.80   1.66    25.78
> > > > > > filestore       27.15   79.91   8.77    19.00
> > > > > > 
> > > > > > 4KB             write   read    randw   randr
> > > > > > default overlay 0.40    61.88   1.29    1.11
> > > > > > no overlay      0.05    61.26   0.05    1.10
> > > > > > filestore       4.14    56.30   0.42    0.76
> > > > > > 
> > > > > > Seekwatcher movies and graphs available here:
> > > > > > 
> > > > > > http://nhm.ceph.com/newstore/20150408/
> > > > > > 
> > > > > > Note for instance the very interesting blktrace patterns for 
> > > > > > 4K random writes on the OSD in each case:
> > > > > > 
> > > > > > http://nhm.ceph.com/newstore/20150408/filestore/RBD_00004096
> > > > > > _randwrite.png
> > > > > > 
> > > > > > 
> > > > > > http://nhm.ceph.com/newstore/20150408/default_overlay/RBD_00
> > > > > > 004096_randwrite.png
> > > > > > 
> > > > > > 
> > > > > > http://nhm.ceph.com/newstore/20150408/no_overlay/RBD_0000409
> > > > > > 6_randwrite.png
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > Mark
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe 
> > > > > > ceph-devel" in the body of a message to 
> > > > > > majordomo@vger.kernel.org More majordomo info at  
> > > > > > http://vger.kernel.org/majordomo-info.html
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe 
> > > > > > ceph-devel" in the body of a message to 
> > > > > > majordomo@vger.kernel.org More majordomo info at  
> > > > > > http://vger.kernel.org/majordomo-info.html
> > > > > 
> > > > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe 
> > ceph-devel" in the body of a message to majordomo@vger.kernel.org 
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Initial newstore vs filestore results
  2015-04-10 23:44                         ` Duan, Jiangang
@ 2015-04-10 23:58                           ` Mark Nelson
  0 siblings, 0 replies; 28+ messages in thread
From: Mark Nelson @ 2015-04-10 23:58 UTC (permalink / raw)
  To: Duan, Jiangang, Sage Weil; +Cc: Ning Yao, ceph-devel

I have some test results with universal compaction we did with joao's 
modbstore benchmark a while back:

http://www.spinics.net/lists/ceph-devel/msg19685.html

More specifically this pdf has data for universal compaction:

http://nhm.ceph.com/mon-store-stress/Monitor_Store_Stress_Medium_Tests.pdf

Mark

On 04/10/2015 06:44 PM, Duan, Jiangang wrote:
> You can try Universal Compaction
> https://github.com/facebook/rocksdb/wiki/Universal-Compaction
>
>
>
> -----Original Message-----
> From: Sage Weil [mailto:sage@newdream.net]
> Sent: Saturday, April 11, 2015 7:24 AM
> To: Mark Nelson
> Cc: Ning Yao; Duan, Jiangang; ceph-devel
> Subject: Re: Initial newstore vs filestore results
>
> On Fri, 10 Apr 2015, Mark Nelson wrote:
>> Notice for instance a comparison of random 512k writes between
>> filestore, newstore with no overlay, and newstore with 8m overlay:
>>
>> http://nhm.ceph.com/newstore/20150409/filestore/RBD_00524288_randwrite
>> .png
>> http://nhm.ceph.com/newstore/20150409/8m_overlay/RBD_00524288_randwrit
>> e.png
>> http://nhm.ceph.com/newstore/20150409/no_overlay/RBD_00524288_randwrit
>> e.png
>>
>> The client rbd throughput as reported by fio is:
>>
>> filestore: 20.44MB/s
>> newstore+no_overlay: 4.35MB/s
>> newstore+8m_overlay: 3.86MB/s
>>
>> But notice that in the graphs, we see very different behaviors on disk.
>>
>> Filestore does a lot of reads and writes to a couple of specific
>> portions of the device and has peaks/valleys when data gets written
>> out in bulk.  I would have expected to see more sequential looking
>> writes during the peaks due to journal writes and no reads to that
>> portion of the disk, but it seems murkier to me than that.
>>
>> http://nhm.ceph.com/newstore/20150409/filestore/RBD_00524288_randwrite
>> _OSD0.mpg
>>
>> newstore+no_overlay does kind of a flurry of random IO and looks like
>> newstore+it's
>> somewhat seek bound.  It's very consistent but actual write
>> performance is low compared to what blktrace reports as the data
>> hitting the disk.  Something happening toward the beginning of the drive too.
>>
>> http://nhm.ceph.com/newstore/20150409/no_overlay/RBD_00524288_randwrit
>> e_OSD0.mpg
>
> Yeah, looks like a bunch of write amplication... the disk bw used is really high.  I think we need to look at what rocksdb is doing here.  A couple things:
>
>   - Make the log bigger, if we can, so that short-lived WAL keys don't get amplified.  We'd rather eat memory than rewrite them in an sst since the number of them in flight is pretty well bounded.
>
>   - The rocksdb log as it stands isn't ever going to perform as well as the FileJournal currently does.  The FileJouranl uses a fixed-size device or file that's preallocated with no 'size' associated with it, so that when there is a write we only have to push down the data blocks (one seek), and on replay can identify valid records with a seq # and checksum.
> Rocksdb's log is a .log file that grows and get's fsync(2)'d, which means that the data blocks have to hit the disk *and* the inode (size) needs to get updated for the commit to happen.  We could improve this by doing a fallocate and turning it into a circular buffer.  I'm not sure XFS will let us fallocate a fresh file of 0's though and avoid a second seek because it'll still need to flip the extent bits when the data blocks are written... or prefill the file with 0's before using it.  :/
>
> sage
>
>
>>
>> newstore+8m overlay is interesting.  Lots of data gets written out to
>> newstore+the disk
>> in seemingly large chunks but the actual throughput as reported by the
>> client is very slow.  I assume there's tons of write amplification
>> happening as rocksdb moves the 512k objects around into different levels.
>>
>> http://nhm.ceph.com/newstore/20150409/8m_overlay/RBD_00524288_randwrit
>> e_OSD0.mpg
>>
>> Mark
>>
>> On 04/10/2015 02:41 PM, Mark Nelson wrote:
>>> Seekwatcher movies and graphs finally finished generating for all of
>>> the
>>> tests:
>>>
>>> http://nhm.ceph.com/newstore/20150409/
>>>
>>> Mark
>>>
>>> On 04/10/2015 10:53 AM, Mark Nelson wrote:
>>>> Test results attached for different overlay settings at various IO
>>>> sizes for writes and random writes.  Basically it looks like as we
>>>> increase the overlay size it changes the curve.  So far we're
>>>> still not doing as good as the filestore (co-located journal) though.
>>>>
>>>> I imagine the WAL probably does play a big part here.
>>>>
>>>> Mark
>>>>
>>>> On 04/10/2015 10:28 AM, Sage Weil wrote:
>>>>> On Fri, 10 Apr 2015, Ning Yao wrote:
>>>>>> KV store introduces too much write amplification, we may need
>>>>>> self-implemented WAL?
>>>>>
>>>>> What we really want is to hint to the kv store that these keys
>>>>> (or this key range) is short-lived and should never get
>>>>> compacted.  And/or, we need to just make sure the wal is
>>>>> sufficiently large so that in practice that never happens to
>>>>> those keys.
>>>>>
>>>>> Putting them outside the kv store means an additional seek/sync
>>>>> for disks, which defeats most of the purpose.  Maybe it makes
>>>>> sense for flash...
>>>>> but
>>>>> the above avoids the problem in either case.
>>>>>
>>>>> I think we should target rocksdb for our initial tuning
>>>>> attempts.  So far all I've done is played a bit with the file
>>>>> size (1mb -> 4mb -> 8mb) but my ad hoc tests didn't see much
>>>>> difference.
>>>>>
>>>>> sage
>>>>>
>>>>>
>>>>>
>>>>>> Regards
>>>>>> Ning Yao
>>>>>>
>>>>>>
>>>>>> 2015-04-10 14:11 GMT+08:00 Duan, Jiangang <jiangang.duan@intel.com>:
>>>>>>> IMHO, the newstore performance depends so much on KV store
>>>>>>> performance due to the WAL -  so pick up the right KV or
>>>>>>> tune it will be the 1st step to do.
>>>>>>>
>>>>>>> -jiangang
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark
>>>>>>> Nelson
>>>>>>> Sent: Friday, April 10, 2015 1:01 AM
>>>>>>> To: Sage Weil
>>>>>>> Cc: ceph-devel
>>>>>>> Subject: Re: Initial newstore vs filestore results
>>>>>>>
>>>>>>> On 04/08/2015 10:19 PM, Mark Nelson wrote:
>>>>>>>> On 04/07/2015 09:58 PM, Sage Weil wrote:
>>>>>>>>> What would be very interesting would be to see the 4KB
>>>>>>>>> performance with the defaults (newstore overlay max =
>>>>>>>>> 32) vs overlays disabled (newstore overlay max = 0) and
>>>>>>>>> see if/how much it is helping.
>>>>>>>>
>>>>>>>> And here we go.  1 OSD, 1X replication.  16GB RBD volume.
>>>>>>>>
>>>>>>>> 4MB        write    read    randw    randr
>>>>>>>> default overlay    36.13    106.61    34.49    92.69
>>>>>>>> no overlay    36.29    105.61    34.49    93.55
>>>>>>>>
>>>>>>>> 128KB        write    read    randw    randr
>>>>>>>> default overlay    1.71    97.90    1.65    25.79
>>>>>>>> no overlay    1.72    97.80    1.66    25.78
>>>>>>>>
>>>>>>>> 4KB        write    read    randw    randr
>>>>>>>> default overlay    0.40    61.88    1.29    1.11
>>>>>>>> no overlay    0.05    61.26    0.05    1.10
>>>>>>>>
>>>>>>>
>>>>>>> Update this morning.  Also ran filestore tests for
>>>>>>> comparison.  Next we'll look at how tweaking the overlay for
>>>>>>> different IO sizes affects things.  IE the overlay threshold
>>>>>>> is 64k right now and it appears that 128K write IOs for
>>>>>>> instance are quite a bit worse with newstore currently than
>>>>>>> with filestore.  Sage also just committed changes that will
>>>>>>> allow overlay writes during append/create which may help improve small IO write performance as well in some cases.
>>>>>>>
>>>>>>> 4MB             write   read    randw   randr
>>>>>>> default overlay 36.13   106.61  34.49   92.69
>>>>>>> no overlay      36.29   105.61  34.49   93.55
>>>>>>> filestore       36.17   84.59   34.11   79.85
>>>>>>>
>>>>>>> 128KB           write   read    randw   randr
>>>>>>> default overlay 1.71    97.90   1.65    25.79
>>>>>>> no overlay      1.72    97.80   1.66    25.78
>>>>>>> filestore       27.15   79.91   8.77    19.00
>>>>>>>
>>>>>>> 4KB             write   read    randw   randr
>>>>>>> default overlay 0.40    61.88   1.29    1.11
>>>>>>> no overlay      0.05    61.26   0.05    1.10
>>>>>>> filestore       4.14    56.30   0.42    0.76
>>>>>>>
>>>>>>> Seekwatcher movies and graphs available here:
>>>>>>>
>>>>>>> http://nhm.ceph.com/newstore/20150408/
>>>>>>>
>>>>>>> Note for instance the very interesting blktrace patterns for
>>>>>>> 4K random writes on the OSD in each case:
>>>>>>>
>>>>>>> http://nhm.ceph.com/newstore/20150408/filestore/RBD_00004096
>>>>>>> _randwrite.png
>>>>>>>
>>>>>>>
>>>>>>> http://nhm.ceph.com/newstore/20150408/default_overlay/RBD_00
>>>>>>> 004096_randwrite.png
>>>>>>>
>>>>>>>
>>>>>>> http://nhm.ceph.com/newstore/20150408/no_overlay/RBD_0000409
>>>>>>> 6_randwrite.png
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Mark
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>> ceph-devel" in the body of a message to
>>>>>>> majordomo@vger.kernel.org More majordomo info at
>>>>>>> http://vger.kernel.org/majordomo-info.html
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>> ceph-devel" in the body of a message to
>>>>>>> majordomo@vger.kernel.org More majordomo info at
>>>>>>> http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe
>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>>
>>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Initial newstore vs filestore results
  2015-04-10 23:43                       ` Duan, Jiangang
@ 2015-04-11  0:09                         ` Mark Nelson
  2015-04-11 13:22                           ` Duan, Jiangang
  0 siblings, 1 reply; 28+ messages in thread
From: Mark Nelson @ 2015-04-11  0:09 UTC (permalink / raw)
  To: Duan, Jiangang, Sage Weil, Ning Yao; +Cc: ceph-devel

Hi Jiangang,

These specific tests are 512K random writes using fio with the librbd 
engine and iodepth of 64.  RBD volumes have been pre-allocated.  There's 
no file system present.

I also collected results for 4k, 8k, 16k, 32k, 64k, 128k, 256k, 512k, 
1024k, 2048k, and 4096k for random and and sequential writes with 
different overlay sizes:

http://nhm.ceph.com/newstore/20150409/

client side performance graphs were posted earlier in the thread here:

http://marc.info/?l=ceph-devel&m=142868123431724&w=2

Mark

On 04/10/2015 06:43 PM, Duan, Jiangang wrote:
> Mark, What is the workload pattern for below data? Small IO or big IO? New file or in-place update in RBD?
>
> Filestore does a lot of reads and writes to a couple of specific portions of the device and has peaks/valleys when data gets written out in bulk.  I would have expected to see more sequential looking writes during the peaks due to journal writes and no reads to that portion of the disk, but it seems murkier to me than that.
>
> http://nhm.ceph.com/newstore/20150409/filestore/RBD_00524288_randwrite_OSD0.mpg
>
> newstore+no_overlay does kind of a flurry of random IO and looks like
> it's somewhat seek bound.  It's very consistent but actual write performance is low compared to what blktrace reports as the data hitting the disk.  Something happening toward the beginning of the drive too.
>
> http://nhm.ceph.com/newstore/20150409/no_overlay/RBD_00524288_randwrite_OSD0.mpg
>
> newstore+8m overlay is interesting.  Lots of data gets written out to
> the disk in seemingly large chunks but the actual throughput as reported by the client is very slow.  I assume there's tons of write amplification happening as rocksdb moves the 512k objects around into different levels.
>
> http://nhm.ceph.com/newstore/20150409/8m_overlay/RBD_00524288_randwrite_OSD0.mpg
>
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Saturday, April 11, 2015 4:05 AM
> To: Sage Weil; Ning Yao
> Cc: Duan, Jiangang; ceph-devel
> Subject: Re: Initial newstore vs filestore results
>
> Notice for instance a comparison of random 512k writes between filestore, newstore with no overlay, and newstore with 8m overlay:
>
> http://nhm.ceph.com/newstore/20150409/filestore/RBD_00524288_randwrite.png
> http://nhm.ceph.com/newstore/20150409/8m_overlay/RBD_00524288_randwrite.png
> http://nhm.ceph.com/newstore/20150409/no_overlay/RBD_00524288_randwrite.png
>
> The client rbd throughput as reported by fio is:
>
> filestore: 20.44MB/s
> newstore+no_overlay: 4.35MB/s
> newstore+8m_overlay: 3.86MB/s
>
> But notice that in the graphs, we see very different behaviors on disk.
>
> Filestore does a lot of reads and writes to a couple of specific portions of the device and has peaks/valleys when data gets written out in bulk.  I would have expected to see more sequential looking writes during the peaks due to journal writes and no reads to that portion of the disk, but it seems murkier to me than that.
>
> http://nhm.ceph.com/newstore/20150409/filestore/RBD_00524288_randwrite_OSD0.mpg
>
> newstore+no_overlay does kind of a flurry of random IO and looks like
> it's somewhat seek bound.  It's very consistent but actual write performance is low compared to what blktrace reports as the data hitting the disk.  Something happening toward the beginning of the drive too.
>
> http://nhm.ceph.com/newstore/20150409/no_overlay/RBD_00524288_randwrite_OSD0.mpg
>
> newstore+8m overlay is interesting.  Lots of data gets written out to
> the disk in seemingly large chunks but the actual throughput as reported by the client is very slow.  I assume there's tons of write amplification happening as rocksdb moves the 512k objects around into different levels.
>
> http://nhm.ceph.com/newstore/20150409/8m_overlay/RBD_00524288_randwrite_OSD0.mpg
>
> Mark
>
> On 04/10/2015 02:41 PM, Mark Nelson wrote:
>> Seekwatcher movies and graphs finally finished generating for all of
>> the
>> tests:
>>
>> http://nhm.ceph.com/newstore/20150409/
>>
>> Mark
>>
>> On 04/10/2015 10:53 AM, Mark Nelson wrote:
>>> Test results attached for different overlay settings at various IO
>>> sizes for writes and random writes.  Basically it looks like as we
>>> increase the overlay size it changes the curve.  So far we're still
>>> not doing as good as the filestore (co-located journal) though.
>>>
>>> I imagine the WAL probably does play a big part here.
>>>
>>> Mark
>>>
>>> On 04/10/2015 10:28 AM, Sage Weil wrote:
>>>> On Fri, 10 Apr 2015, Ning Yao wrote:
>>>>> KV store introduces too much write amplification, we may need
>>>>> self-implemented WAL?
>>>>
>>>> What we really want is to hint to the kv store that these keys (or
>>>> this key range) is short-lived and should never get compacted.
>>>> And/or, we need to just make sure the wal is sufficiently large so
>>>> that in practice that never happens to those keys.
>>>>
>>>> Putting them outside the kv store means an additional seek/sync for
>>>> disks, which defeats most of the purpose.  Maybe it makes sense for
>>>> flash...
>>>> but
>>>> the above avoids the problem in either case.
>>>>
>>>> I think we should target rocksdb for our initial tuning attempts.
>>>> So far all I've done is played a bit with the file size (1mb -> 4mb
>>>> -> 8mb) but my ad hoc tests didn't see much difference.
>>>>
>>>> sage
>>>>
>>>>
>>>>
>>>>> Regards
>>>>> Ning Yao
>>>>>
>>>>>
>>>>> 2015-04-10 14:11 GMT+08:00 Duan, Jiangang <jiangang.duan@intel.com>:
>>>>>> IMHO, the newstore performance depends so much on KV store
>>>>>> performance due to the WAL -  so pick up the right KV or tune it
>>>>>> will be the 1st step to do.
>>>>>>
>>>>>> -jiangang
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>>>>> Sent: Friday, April 10, 2015 1:01 AM
>>>>>> To: Sage Weil
>>>>>> Cc: ceph-devel
>>>>>> Subject: Re: Initial newstore vs filestore results
>>>>>>
>>>>>> On 04/08/2015 10:19 PM, Mark Nelson wrote:
>>>>>>> On 04/07/2015 09:58 PM, Sage Weil wrote:
>>>>>>>> What would be very interesting would be to see the 4KB
>>>>>>>> performance with the defaults (newstore overlay max = 32) vs
>>>>>>>> overlays disabled (newstore overlay max = 0) and see if/how much it is helping.
>>>>>>>
>>>>>>> And here we go.  1 OSD, 1X replication.  16GB RBD volume.
>>>>>>>
>>>>>>> 4MB        write    read    randw    randr
>>>>>>> default overlay    36.13    106.61    34.49    92.69
>>>>>>> no overlay    36.29    105.61    34.49    93.55
>>>>>>>
>>>>>>> 128KB        write    read    randw    randr
>>>>>>> default overlay    1.71    97.90    1.65    25.79
>>>>>>> no overlay    1.72    97.80    1.66    25.78
>>>>>>>
>>>>>>> 4KB        write    read    randw    randr
>>>>>>> default overlay    0.40    61.88    1.29    1.11
>>>>>>> no overlay    0.05    61.26    0.05    1.10
>>>>>>>
>>>>>>
>>>>>> Update this morning.  Also ran filestore tests for comparison.
>>>>>> Next we'll look at how tweaking the overlay for different IO sizes
>>>>>> affects things.  IE the overlay threshold is 64k right now and it
>>>>>> appears that 128K write IOs for instance are quite a bit worse
>>>>>> with newstore currently than with filestore.  Sage also just
>>>>>> committed changes that will allow overlay writes during
>>>>>> append/create which may help improve small IO write performance as well in some cases.
>>>>>>
>>>>>> 4MB             write   read    randw   randr
>>>>>> default overlay 36.13   106.61  34.49   92.69
>>>>>> no overlay      36.29   105.61  34.49   93.55
>>>>>> filestore       36.17   84.59   34.11   79.85
>>>>>>
>>>>>> 128KB           write   read    randw   randr
>>>>>> default overlay 1.71    97.90   1.65    25.79
>>>>>> no overlay      1.72    97.80   1.66    25.78
>>>>>> filestore       27.15   79.91   8.77    19.00
>>>>>>
>>>>>> 4KB             write   read    randw   randr
>>>>>> default overlay 0.40    61.88   1.29    1.11
>>>>>> no overlay      0.05    61.26   0.05    1.10
>>>>>> filestore       4.14    56.30   0.42    0.76
>>>>>>
>>>>>> Seekwatcher movies and graphs available here:
>>>>>>
>>>>>> http://nhm.ceph.com/newstore/20150408/
>>>>>>
>>>>>> Note for instance the very interesting blktrace patterns for 4K
>>>>>> random writes on the OSD in each case:
>>>>>>
>>>>>> http://nhm.ceph.com/newstore/20150408/filestore/RBD_00004096_randw
>>>>>> rite.png
>>>>>>
>>>>>>
>>>>>> http://nhm.ceph.com/newstore/20150408/default_overlay/RBD_00004096
>>>>>> _randwrite.png
>>>>>>
>>>>>>
>>>>>> http://nhm.ceph.com/newstore/20150408/no_overlay/RBD_00004096_rand
>>>>>> write.png
>>>>>>
>>>>>>
>>>>>>
>>>>>> Mark
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Initial newstore vs filestore results
  2015-04-11  0:09                         ` Mark Nelson
@ 2015-04-11 13:22                           ` Duan, Jiangang
  0 siblings, 0 replies; 28+ messages in thread
From: Duan, Jiangang @ 2015-04-11 13:22 UTC (permalink / raw)
  To: Mark Nelson, Sage Weil, Ning Yao; +Cc: ceph-devel

Thanks. 

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Saturday, April 11, 2015 8:09 AM
To: Duan, Jiangang; Sage Weil; Ning Yao
Cc: ceph-devel
Subject: Re: Initial newstore vs filestore results

Hi Jiangang,

These specific tests are 512K random writes using fio with the librbd engine and iodepth of 64.  RBD volumes have been pre-allocated.  There's no file system present.

I also collected results for 4k, 8k, 16k, 32k, 64k, 128k, 256k, 512k, 1024k, 2048k, and 4096k for random and and sequential writes with different overlay sizes:

http://nhm.ceph.com/newstore/20150409/

client side performance graphs were posted earlier in the thread here:

http://marc.info/?l=ceph-devel&m=142868123431724&w=2

Mark

On 04/10/2015 06:43 PM, Duan, Jiangang wrote:
> Mark, What is the workload pattern for below data? Small IO or big IO? New file or in-place update in RBD?
>
> Filestore does a lot of reads and writes to a couple of specific portions of the device and has peaks/valleys when data gets written out in bulk.  I would have expected to see more sequential looking writes during the peaks due to journal writes and no reads to that portion of the disk, but it seems murkier to me than that.
>
> http://nhm.ceph.com/newstore/20150409/filestore/RBD_00524288_randwrite
> _OSD0.mpg
>
> newstore+no_overlay does kind of a flurry of random IO and looks like
> it's somewhat seek bound.  It's very consistent but actual write performance is low compared to what blktrace reports as the data hitting the disk.  Something happening toward the beginning of the drive too.
>
> http://nhm.ceph.com/newstore/20150409/no_overlay/RBD_00524288_randwrit
> e_OSD0.mpg
>
> newstore+8m overlay is interesting.  Lots of data gets written out to
> the disk in seemingly large chunks but the actual throughput as reported by the client is very slow.  I assume there's tons of write amplification happening as rocksdb moves the 512k objects around into different levels.
>
> http://nhm.ceph.com/newstore/20150409/8m_overlay/RBD_00524288_randwrit
> e_OSD0.mpg
>
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org 
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Saturday, April 11, 2015 4:05 AM
> To: Sage Weil; Ning Yao
> Cc: Duan, Jiangang; ceph-devel
> Subject: Re: Initial newstore vs filestore results
>
> Notice for instance a comparison of random 512k writes between filestore, newstore with no overlay, and newstore with 8m overlay:
>
> http://nhm.ceph.com/newstore/20150409/filestore/RBD_00524288_randwrite
> .png 
> http://nhm.ceph.com/newstore/20150409/8m_overlay/RBD_00524288_randwrit
> e.png 
> http://nhm.ceph.com/newstore/20150409/no_overlay/RBD_00524288_randwrit
> e.png
>
> The client rbd throughput as reported by fio is:
>
> filestore: 20.44MB/s
> newstore+no_overlay: 4.35MB/s
> newstore+8m_overlay: 3.86MB/s
>
> But notice that in the graphs, we see very different behaviors on disk.
>
> Filestore does a lot of reads and writes to a couple of specific portions of the device and has peaks/valleys when data gets written out in bulk.  I would have expected to see more sequential looking writes during the peaks due to journal writes and no reads to that portion of the disk, but it seems murkier to me than that.
>
> http://nhm.ceph.com/newstore/20150409/filestore/RBD_00524288_randwrite
> _OSD0.mpg
>
> newstore+no_overlay does kind of a flurry of random IO and looks like
> it's somewhat seek bound.  It's very consistent but actual write performance is low compared to what blktrace reports as the data hitting the disk.  Something happening toward the beginning of the drive too.
>
> http://nhm.ceph.com/newstore/20150409/no_overlay/RBD_00524288_randwrit
> e_OSD0.mpg
>
> newstore+8m overlay is interesting.  Lots of data gets written out to
> the disk in seemingly large chunks but the actual throughput as reported by the client is very slow.  I assume there's tons of write amplification happening as rocksdb moves the 512k objects around into different levels.
>
> http://nhm.ceph.com/newstore/20150409/8m_overlay/RBD_00524288_randwrit
> e_OSD0.mpg
>
> Mark
>
> On 04/10/2015 02:41 PM, Mark Nelson wrote:
>> Seekwatcher movies and graphs finally finished generating for all of 
>> the
>> tests:
>>
>> http://nhm.ceph.com/newstore/20150409/
>>
>> Mark
>>
>> On 04/10/2015 10:53 AM, Mark Nelson wrote:
>>> Test results attached for different overlay settings at various IO 
>>> sizes for writes and random writes.  Basically it looks like as we 
>>> increase the overlay size it changes the curve.  So far we're still 
>>> not doing as good as the filestore (co-located journal) though.
>>>
>>> I imagine the WAL probably does play a big part here.
>>>
>>> Mark
>>>
>>> On 04/10/2015 10:28 AM, Sage Weil wrote:
>>>> On Fri, 10 Apr 2015, Ning Yao wrote:
>>>>> KV store introduces too much write amplification, we may need 
>>>>> self-implemented WAL?
>>>>
>>>> What we really want is to hint to the kv store that these keys (or 
>>>> this key range) is short-lived and should never get compacted.
>>>> And/or, we need to just make sure the wal is sufficiently large so 
>>>> that in practice that never happens to those keys.
>>>>
>>>> Putting them outside the kv store means an additional seek/sync for 
>>>> disks, which defeats most of the purpose.  Maybe it makes sense for 
>>>> flash...
>>>> but
>>>> the above avoids the problem in either case.
>>>>
>>>> I think we should target rocksdb for our initial tuning attempts.
>>>> So far all I've done is played a bit with the file size (1mb -> 4mb
>>>> -> 8mb) but my ad hoc tests didn't see much difference.
>>>>
>>>> sage
>>>>
>>>>
>>>>
>>>>> Regards
>>>>> Ning Yao
>>>>>
>>>>>
>>>>> 2015-04-10 14:11 GMT+08:00 Duan, Jiangang <jiangang.duan@intel.com>:
>>>>>> IMHO, the newstore performance depends so much on KV store 
>>>>>> performance due to the WAL -  so pick up the right KV or tune it 
>>>>>> will be the 1st step to do.
>>>>>>
>>>>>> -jiangang
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: ceph-devel-owner@vger.kernel.org 
>>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark 
>>>>>> Nelson
>>>>>> Sent: Friday, April 10, 2015 1:01 AM
>>>>>> To: Sage Weil
>>>>>> Cc: ceph-devel
>>>>>> Subject: Re: Initial newstore vs filestore results
>>>>>>
>>>>>> On 04/08/2015 10:19 PM, Mark Nelson wrote:
>>>>>>> On 04/07/2015 09:58 PM, Sage Weil wrote:
>>>>>>>> What would be very interesting would be to see the 4KB 
>>>>>>>> performance with the defaults (newstore overlay max = 32) vs 
>>>>>>>> overlays disabled (newstore overlay max = 0) and see if/how much it is helping.
>>>>>>>
>>>>>>> And here we go.  1 OSD, 1X replication.  16GB RBD volume.
>>>>>>>
>>>>>>> 4MB        write    read    randw    randr
>>>>>>> default overlay    36.13    106.61    34.49    92.69
>>>>>>> no overlay    36.29    105.61    34.49    93.55
>>>>>>>
>>>>>>> 128KB        write    read    randw    randr
>>>>>>> default overlay    1.71    97.90    1.65    25.79
>>>>>>> no overlay    1.72    97.80    1.66    25.78
>>>>>>>
>>>>>>> 4KB        write    read    randw    randr
>>>>>>> default overlay    0.40    61.88    1.29    1.11
>>>>>>> no overlay    0.05    61.26    0.05    1.10
>>>>>>>
>>>>>>
>>>>>> Update this morning.  Also ran filestore tests for comparison.
>>>>>> Next we'll look at how tweaking the overlay for different IO 
>>>>>> sizes affects things.  IE the overlay threshold is 64k right now 
>>>>>> and it appears that 128K write IOs for instance are quite a bit 
>>>>>> worse with newstore currently than with filestore.  Sage also 
>>>>>> just committed changes that will allow overlay writes during 
>>>>>> append/create which may help improve small IO write performance as well in some cases.
>>>>>>
>>>>>> 4MB             write   read    randw   randr
>>>>>> default overlay 36.13   106.61  34.49   92.69
>>>>>> no overlay      36.29   105.61  34.49   93.55
>>>>>> filestore       36.17   84.59   34.11   79.85
>>>>>>
>>>>>> 128KB           write   read    randw   randr
>>>>>> default overlay 1.71    97.90   1.65    25.79
>>>>>> no overlay      1.72    97.80   1.66    25.78
>>>>>> filestore       27.15   79.91   8.77    19.00
>>>>>>
>>>>>> 4KB             write   read    randw   randr
>>>>>> default overlay 0.40    61.88   1.29    1.11
>>>>>> no overlay      0.05    61.26   0.05    1.10
>>>>>> filestore       4.14    56.30   0.42    0.76
>>>>>>
>>>>>> Seekwatcher movies and graphs available here:
>>>>>>
>>>>>> http://nhm.ceph.com/newstore/20150408/
>>>>>>
>>>>>> Note for instance the very interesting blktrace patterns for 4K 
>>>>>> random writes on the OSD in each case:
>>>>>>
>>>>>> http://nhm.ceph.com/newstore/20150408/filestore/RBD_00004096_rand
>>>>>> w
>>>>>> rite.png
>>>>>>
>>>>>>
>>>>>> http://nhm.ceph.com/newstore/20150408/default_overlay/RBD_0000409
>>>>>> 6
>>>>>> _randwrite.png
>>>>>>
>>>>>>
>>>>>> http://nhm.ceph.com/newstore/20150408/no_overlay/RBD_00004096_ran
>>>>>> d
>>>>>> write.png
>>>>>>
>>>>>>
>>>>>>
>>>>>> Mark
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe 
>>>>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org 
>>>>>> More majordomo info at  
>>>>>> http://vger.kernel.org/majordomo-info.html
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe 
>>>>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org 
>>>>>> More majordomo info at  
>>>>>> http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2015-04-11 13:23 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-07 14:57 Initial newstore vs filestore results Mark Nelson
2015-04-07 19:16 ` Mark Nelson
2015-04-08  1:45   ` Mark Nelson
2015-04-08  1:48     ` Somnath Roy
2015-04-08  1:53       ` Mark Nelson
2015-04-08  2:26         ` Chen, Xiaoxi
2015-04-08  2:58     ` Sage Weil
2015-04-08  7:24       ` Haomai Wang
2015-04-08 16:49         ` Sage Weil
2015-04-08 17:19           ` Gregory Farnum
2015-04-08 17:38             ` Sage Weil
2015-04-08 19:16           ` Milosz Tanski
2015-04-08 14:38       ` Mark Nelson
2015-04-09  3:19       ` Mark Nelson
2015-04-09 17:00         ` Mark Nelson
2015-04-10  6:11           ` Duan, Jiangang
2015-04-10 10:25             ` Ning Yao
2015-04-10 15:28               ` Sage Weil
2015-04-10 15:53                 ` Mark Nelson
2015-04-10 19:41                   ` Mark Nelson
2015-04-10 20:04                     ` Mark Nelson
2015-04-10 23:24                       ` Sage Weil
2015-04-10 23:44                         ` Duan, Jiangang
2015-04-10 23:58                           ` Mark Nelson
2015-04-10 23:43                       ` Duan, Jiangang
2015-04-11  0:09                         ` Mark Nelson
2015-04-11 13:22                           ` Duan, Jiangang
2015-04-10 12:07             ` Mark Nelson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.