All of lore.kernel.org
 help / color / mirror / Atom feed
* wip-librbd-caching
@ 2012-04-12 19:30 Martin Mailand
  2012-04-12 19:45 ` wip-librbd-caching Sage Weil
  0 siblings, 1 reply; 9+ messages in thread
From: Martin Mailand @ 2012-04-12 19:30 UTC (permalink / raw)
  To: ceph-devel; +Cc: Josh Durgin

Hi,

today I tried the wip-librbd-caching branch. The performance improvement 
is very good particular for small writes.
I tested from within a vm with fio:

rbd_cache_enabled=1

fio -name iops -rw=write -size=10G -iodepth 1 -filename /tmp/bigfile 
-ioengine libaio -direct 1 -bs 4k

I get over 10k iops

With an iodepth 4 I get over 30k iops

In comparison with the rbd_writebackwindow I get around 5k iops with an 
iodepth of 1.

So far the whole cluster is running stable for over 12 hours.

But there is also a downside.
My typical vm are 1Gb in size, the default cache size is 200Mb, which is 
20% more memory usage. Maybe 50Mb or less will be enough?
I am going to test that.

The other point is, that the cache is not KSM enabled, therefore 
identical pages will not be merged, could that be changed, what would be 
the downside?

So maybe we could reduce the memory footprint of the cache, but keep 
it's performance.

-martin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: wip-librbd-caching
  2012-04-12 19:30 wip-librbd-caching Martin Mailand
@ 2012-04-12 19:45 ` Sage Weil
  2012-04-12 19:48   ` wip-librbd-caching Damien Churchill
                     ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Sage Weil @ 2012-04-12 19:45 UTC (permalink / raw)
  To: Martin Mailand; +Cc: ceph-devel, Josh Durgin

On Thu, 12 Apr 2012, Martin Mailand wrote:
> Hi,
> 
> today I tried the wip-librbd-caching branch. The performance improvement is
> very good particular for small writes.
> I tested from within a vm with fio:
> 
> rbd_cache_enabled=1
> 
> fio -name iops -rw=write -size=10G -iodepth 1 -filename /tmp/bigfile -ioengine
> libaio -direct 1 -bs 4k
> 
> I get over 10k iops
> 
> With an iodepth 4 I get over 30k iops
> 
> In comparison with the rbd_writebackwindow I get around 5k iops with an
> iodepth of 1.
> 
> So far the whole cluster is running stable for over 12 hours.

Great to hear!
 
> But there is also a downside.
> My typical vm are 1Gb in size, the default cache size is 200Mb, which is 20%
> more memory usage. Maybe 50Mb or less will be enough?
> I am going to test that.

The config options you'll want to look at are client_oc_* (in case you 
didn't see that already :).  "oc" is short for objectcacher, and it isn't 
only used for client (libcephfs), so it might be worth renaming these 
options before people start using them.

> The other point is, that the cache is not KSM enabled, therefore identical
> pages will not be merged, could that be changed, what would be the downside?
> 
> So maybe we could reduce the memory footprint of the cache, but keep it's
> performance.

I'm not familiar with the performance implications of KSM, but the 
objectcacher doesn't modify existing buffers in place, so I suspect it's a 
good candidate.  And it looks like there's minimal effort in enabling 
it...

sage

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: wip-librbd-caching
  2012-04-12 19:45 ` wip-librbd-caching Sage Weil
@ 2012-04-12 19:48   ` Damien Churchill
  2012-04-12 19:54   ` wip-librbd-caching Tommi Virtanen
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Damien Churchill @ 2012-04-12 19:48 UTC (permalink / raw)
  To: Sage Weil; +Cc: Martin Mailand, ceph-devel, Josh Durgin

On 12 April 2012 20:45, Sage Weil <sage@newdream.net> wrote:
> I'm not familiar with the performance implications of KSM, but the
> objectcacher doesn't modify existing buffers in place, so I suspect it's a
> good candidate.  And it looks like there's minimal effort in enabling
> it...

It uses some CPU when calculating hashes, although I believe if it
gets to be too resource consuming it is possible to disable it and
continue to use the shared pages you have already calculated, just not
updating or checking for any others that could be shared.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: wip-librbd-caching
  2012-04-12 19:45 ` wip-librbd-caching Sage Weil
  2012-04-12 19:48   ` wip-librbd-caching Damien Churchill
@ 2012-04-12 19:54   ` Tommi Virtanen
  2012-04-12 20:20     ` wip-librbd-caching Sage Weil
  2012-04-12 19:55   ` wip-librbd-caching Greg Farnum
  2012-04-18 12:50   ` wip-librbd-caching Martin Mailand
  3 siblings, 1 reply; 9+ messages in thread
From: Tommi Virtanen @ 2012-04-12 19:54 UTC (permalink / raw)
  To: Sage Weil; +Cc: Martin Mailand, ceph-devel, Josh Durgin

On Thu, Apr 12, 2012 at 12:45, Sage Weil <sage@newdream.net> wrote:
>> So maybe we could reduce the memory footprint of the cache, but keep it's
>> performance.
>
> I'm not familiar with the performance implications of KSM, but the
> objectcacher doesn't modify existing buffers in place, so I suspect it's a
> good candidate.  And it looks like there's minimal effort in enabling
> it...

Are the objectcacher cache entries full pages, page aligned, with no
bookkeeping data inside the page? Those are pretty much the
requirements for page-granularity dedup to work..
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: wip-librbd-caching
  2012-04-12 19:45 ` wip-librbd-caching Sage Weil
  2012-04-12 19:48   ` wip-librbd-caching Damien Churchill
  2012-04-12 19:54   ` wip-librbd-caching Tommi Virtanen
@ 2012-04-12 19:55   ` Greg Farnum
  2012-04-18 12:50   ` wip-librbd-caching Martin Mailand
  3 siblings, 0 replies; 9+ messages in thread
From: Greg Farnum @ 2012-04-12 19:55 UTC (permalink / raw)
  To: Sage Weil; +Cc: Martin Mailand, ceph-devel, Josh Durgin

On Thursday, April 12, 2012 at 12:45 PM, Sage Weil wrote:
> On Thu, 12 Apr 2012, Martin Mailand wrote:
> > The other point is, that the cache is not KSM enabled, therefore identical
> > pages will not be merged, could that be changed, what would be the downside?
> >  
> > So maybe we could reduce the memory footprint of the cache, but keep it's
> > performance.
>  
>  
>  
> I'm not familiar with the performance implications of KSM, but the  
> objectcacher doesn't modify existing buffers in place, so I suspect it's a  
> good candidate. And it looks like there's minimal effort in enabling  
> it...


But if you're supposed to advise the kernel that the memory is a good candidate, then probably we shouldn't be making that madvise call on every buffer (I imagine it's doing a sha1 on each page and then examining a tree) — especially since we (probably) flush all that data out relatively quickly. And RBD doesn't currently have any information about whether the data is OS or user data… (I guess in future, with layering, we could call madvise on pages which were read from an underlying gold image.)
Also, TV is wondering if the data is even page-aligned or not? I can't recall off-hand.
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: wip-librbd-caching
  2012-04-12 19:54   ` wip-librbd-caching Tommi Virtanen
@ 2012-04-12 20:20     ` Sage Weil
  0 siblings, 0 replies; 9+ messages in thread
From: Sage Weil @ 2012-04-12 20:20 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: Martin Mailand, ceph-devel, Josh Durgin

[-- Attachment #1: Type: TEXT/PLAIN, Size: 962 bytes --]

On Thu, 12 Apr 2012, Tommi Virtanen wrote:
> On Thu, Apr 12, 2012 at 12:45, Sage Weil <sage@newdream.net> wrote:
> >> So maybe we could reduce the memory footprint of the cache, but keep it's
> >> performance.
> >
> > I'm not familiar with the performance implications of KSM, but the
> > objectcacher doesn't modify existing buffers in place, so I suspect it's a
> > good candidate.  And it looks like there's minimal effort in enabling
> > it...
> 
> Are the objectcacher cache entries full pages, page aligned, with no
> bookkeeping data inside the page? Those are pretty much the
> requirements for page-granularity dedup to work..

Some buffers are, some aren't, but we'd only want to madvise on page 
aligned ones.  The messenger is careful to read things into aligned 
memory, and librbd will only be getting block-sized (probably page-sized, 
if we say we have 4k blocks) IO... so that should include every buffer in 
this case.

sage

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: wip-librbd-caching
  2012-04-12 19:45 ` wip-librbd-caching Sage Weil
                     ` (2 preceding siblings ...)
  2012-04-12 19:55   ` wip-librbd-caching Greg Farnum
@ 2012-04-18 12:50   ` Martin Mailand
  2012-04-18 16:27     ` wip-librbd-caching Greg Farnum
  2012-04-18 17:44     ` wip-librbd-caching Sage Weil
  3 siblings, 2 replies; 9+ messages in thread
From: Martin Mailand @ 2012-04-18 12:50 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, Josh Durgin

Am 12.04.2012 21:45, schrieb Sage Weil:
> The config options you'll want to look at are client_oc_* (in case you
> didn't see that already :).  "oc" is short for objectcacher, and it isn't
> only used for client (libcephfs), so it might be worth renaming these
> options before people start using them.

Hi,

I changed the values and the performance is still very good and the 
memory footprint is much smaller.

OPTION(client_oc_size, OPT_INT, 1024*1024* 50)    // MB * n
OPTION(client_oc_max_dirty, OPT_INT, 1024*1024* 25)    // MB * n  (dirty 
OR tx.. bigish)
OPTION(client_oc_target_dirty, OPT_INT, 1024*1024* 8) // target dirty 
(keep this smallish)
// note: the max amount of "in flight" dirty data is roughly (max - target)

But I am not quite sure about the meaning of the values.
client_oc_size Max size of the cache?
client_oc_max_dirty max dirty value before the writeback starts?
client_oc_target_dirty ???


-martin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: wip-librbd-caching
  2012-04-18 12:50   ` wip-librbd-caching Martin Mailand
@ 2012-04-18 16:27     ` Greg Farnum
  2012-04-18 17:44     ` wip-librbd-caching Sage Weil
  1 sibling, 0 replies; 9+ messages in thread
From: Greg Farnum @ 2012-04-18 16:27 UTC (permalink / raw)
  To: Martin Mailand; +Cc: Sage Weil, ceph-devel, Josh Durgin

On Wednesday, April 18, 2012 at 5:50 AM, Martin Mailand wrote:
> Hi,
>  
> I changed the values and the performance is still very good and the  
> memory footprint is much smaller.
>  
> OPTION(client_oc_size, OPT_INT, 1024*1024* 50) // MB * n
> OPTION(client_oc_max_dirty, OPT_INT, 1024*1024* 25) // MB * n (dirty  
> OR tx.. bigish)
> OPTION(client_oc_target_dirty, OPT_INT, 1024*1024* 8) // target dirty  
> (keep this smallish)
> // note: the max amount of "in flight" dirty data is roughly (max - target)
>  
> But I am not quite sure about the meaning of the values.
> client_oc_size Max size of the cache?
> client_oc_max_dirty max dirty value before the writeback starts?
> client_oc_target_dirty ???
>  

Right now the cache writeout algorithms are based on amount of dirty data, rather than something like how long the data has been dirty.  
client_oc_size is the max (and therefore typical) size of the cache.
client_oc_max_dirty is the largest amount of dirty data in the cache — if this much is dirty and you try to dirty more, the dirtier (a write of some kind) will block until some of the other dirty data has been committed.
client_oc_target_dirty is the amount of dirty data that will trigger the cache to start flushing data out.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: wip-librbd-caching
  2012-04-18 12:50   ` wip-librbd-caching Martin Mailand
  2012-04-18 16:27     ` wip-librbd-caching Greg Farnum
@ 2012-04-18 17:44     ` Sage Weil
  1 sibling, 0 replies; 9+ messages in thread
From: Sage Weil @ 2012-04-18 17:44 UTC (permalink / raw)
  To: Martin Mailand; +Cc: ceph-devel, Josh Durgin

On Wed, 18 Apr 2012, Martin Mailand wrote:
> Am 12.04.2012 21:45, schrieb Sage Weil:
> > The config options you'll want to look at are client_oc_* (in case you
> > didn't see that already :).  "oc" is short for objectcacher, and it isn't
> > only used for client (libcephfs), so it might be worth renaming these
> > options before people start using them.
> 
> Hi,
> 
> I changed the values and the performance is still very good and the memory
> footprint is much smaller.
> 
> OPTION(client_oc_size, OPT_INT, 1024*1024* 50)    // MB * n
> OPTION(client_oc_max_dirty, OPT_INT, 1024*1024* 25)    // MB * n  (dirty OR
> tx.. bigish)
> OPTION(client_oc_target_dirty, OPT_INT, 1024*1024* 8) // target dirty (keep
> this smallish)
> // note: the max amount of "in flight" dirty data is roughly (max - target)
> 
> But I am not quite sure about the meaning of the values.
> client_oc_size Max size of the cache?

yes

> client_oc_max_dirty max dirty value before the writeback starts?

before writes block and wait for writeback to bring the dirty level down

> client_oc_target_dirty ???

before writeback starts

BTW I renamed 'rbd cache enabled' -> 'rbd cache'.  I'd like to rename the 
objectcacher settings too so they aren't nested under client_ (which is 
the fs client code).

objectcacher_*?

sage

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2012-04-18 17:44 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-12 19:30 wip-librbd-caching Martin Mailand
2012-04-12 19:45 ` wip-librbd-caching Sage Weil
2012-04-12 19:48   ` wip-librbd-caching Damien Churchill
2012-04-12 19:54   ` wip-librbd-caching Tommi Virtanen
2012-04-12 20:20     ` wip-librbd-caching Sage Weil
2012-04-12 19:55   ` wip-librbd-caching Greg Farnum
2012-04-18 12:50   ` wip-librbd-caching Martin Mailand
2012-04-18 16:27     ` wip-librbd-caching Greg Farnum
2012-04-18 17:44     ` wip-librbd-caching Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.