* wip-librbd-caching
@ 2012-04-12 19:30 Martin Mailand
2012-04-12 19:45 ` wip-librbd-caching Sage Weil
0 siblings, 1 reply; 9+ messages in thread
From: Martin Mailand @ 2012-04-12 19:30 UTC (permalink / raw)
To: ceph-devel; +Cc: Josh Durgin
Hi,
today I tried the wip-librbd-caching branch. The performance improvement
is very good particular for small writes.
I tested from within a vm with fio:
rbd_cache_enabled=1
fio -name iops -rw=write -size=10G -iodepth 1 -filename /tmp/bigfile
-ioengine libaio -direct 1 -bs 4k
I get over 10k iops
With an iodepth 4 I get over 30k iops
In comparison with the rbd_writebackwindow I get around 5k iops with an
iodepth of 1.
So far the whole cluster is running stable for over 12 hours.
But there is also a downside.
My typical vm are 1Gb in size, the default cache size is 200Mb, which is
20% more memory usage. Maybe 50Mb or less will be enough?
I am going to test that.
The other point is, that the cache is not KSM enabled, therefore
identical pages will not be merged, could that be changed, what would be
the downside?
So maybe we could reduce the memory footprint of the cache, but keep
it's performance.
-martin
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: wip-librbd-caching
2012-04-12 19:30 wip-librbd-caching Martin Mailand
@ 2012-04-12 19:45 ` Sage Weil
2012-04-12 19:48 ` wip-librbd-caching Damien Churchill
` (3 more replies)
0 siblings, 4 replies; 9+ messages in thread
From: Sage Weil @ 2012-04-12 19:45 UTC (permalink / raw)
To: Martin Mailand; +Cc: ceph-devel, Josh Durgin
On Thu, 12 Apr 2012, Martin Mailand wrote:
> Hi,
>
> today I tried the wip-librbd-caching branch. The performance improvement is
> very good particular for small writes.
> I tested from within a vm with fio:
>
> rbd_cache_enabled=1
>
> fio -name iops -rw=write -size=10G -iodepth 1 -filename /tmp/bigfile -ioengine
> libaio -direct 1 -bs 4k
>
> I get over 10k iops
>
> With an iodepth 4 I get over 30k iops
>
> In comparison with the rbd_writebackwindow I get around 5k iops with an
> iodepth of 1.
>
> So far the whole cluster is running stable for over 12 hours.
Great to hear!
> But there is also a downside.
> My typical vm are 1Gb in size, the default cache size is 200Mb, which is 20%
> more memory usage. Maybe 50Mb or less will be enough?
> I am going to test that.
The config options you'll want to look at are client_oc_* (in case you
didn't see that already :). "oc" is short for objectcacher, and it isn't
only used for client (libcephfs), so it might be worth renaming these
options before people start using them.
> The other point is, that the cache is not KSM enabled, therefore identical
> pages will not be merged, could that be changed, what would be the downside?
>
> So maybe we could reduce the memory footprint of the cache, but keep it's
> performance.
I'm not familiar with the performance implications of KSM, but the
objectcacher doesn't modify existing buffers in place, so I suspect it's a
good candidate. And it looks like there's minimal effort in enabling
it...
sage
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: wip-librbd-caching
2012-04-12 19:45 ` wip-librbd-caching Sage Weil
@ 2012-04-12 19:48 ` Damien Churchill
2012-04-12 19:54 ` wip-librbd-caching Tommi Virtanen
` (2 subsequent siblings)
3 siblings, 0 replies; 9+ messages in thread
From: Damien Churchill @ 2012-04-12 19:48 UTC (permalink / raw)
To: Sage Weil; +Cc: Martin Mailand, ceph-devel, Josh Durgin
On 12 April 2012 20:45, Sage Weil <sage@newdream.net> wrote:
> I'm not familiar with the performance implications of KSM, but the
> objectcacher doesn't modify existing buffers in place, so I suspect it's a
> good candidate. And it looks like there's minimal effort in enabling
> it...
It uses some CPU when calculating hashes, although I believe if it
gets to be too resource consuming it is possible to disable it and
continue to use the shared pages you have already calculated, just not
updating or checking for any others that could be shared.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: wip-librbd-caching
2012-04-12 19:45 ` wip-librbd-caching Sage Weil
2012-04-12 19:48 ` wip-librbd-caching Damien Churchill
@ 2012-04-12 19:54 ` Tommi Virtanen
2012-04-12 20:20 ` wip-librbd-caching Sage Weil
2012-04-12 19:55 ` wip-librbd-caching Greg Farnum
2012-04-18 12:50 ` wip-librbd-caching Martin Mailand
3 siblings, 1 reply; 9+ messages in thread
From: Tommi Virtanen @ 2012-04-12 19:54 UTC (permalink / raw)
To: Sage Weil; +Cc: Martin Mailand, ceph-devel, Josh Durgin
On Thu, Apr 12, 2012 at 12:45, Sage Weil <sage@newdream.net> wrote:
>> So maybe we could reduce the memory footprint of the cache, but keep it's
>> performance.
>
> I'm not familiar with the performance implications of KSM, but the
> objectcacher doesn't modify existing buffers in place, so I suspect it's a
> good candidate. And it looks like there's minimal effort in enabling
> it...
Are the objectcacher cache entries full pages, page aligned, with no
bookkeeping data inside the page? Those are pretty much the
requirements for page-granularity dedup to work..
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: wip-librbd-caching
2012-04-12 19:45 ` wip-librbd-caching Sage Weil
2012-04-12 19:48 ` wip-librbd-caching Damien Churchill
2012-04-12 19:54 ` wip-librbd-caching Tommi Virtanen
@ 2012-04-12 19:55 ` Greg Farnum
2012-04-18 12:50 ` wip-librbd-caching Martin Mailand
3 siblings, 0 replies; 9+ messages in thread
From: Greg Farnum @ 2012-04-12 19:55 UTC (permalink / raw)
To: Sage Weil; +Cc: Martin Mailand, ceph-devel, Josh Durgin
On Thursday, April 12, 2012 at 12:45 PM, Sage Weil wrote:
> On Thu, 12 Apr 2012, Martin Mailand wrote:
> > The other point is, that the cache is not KSM enabled, therefore identical
> > pages will not be merged, could that be changed, what would be the downside?
> >
> > So maybe we could reduce the memory footprint of the cache, but keep it's
> > performance.
>
>
>
> I'm not familiar with the performance implications of KSM, but the
> objectcacher doesn't modify existing buffers in place, so I suspect it's a
> good candidate. And it looks like there's minimal effort in enabling
> it...
But if you're supposed to advise the kernel that the memory is a good candidate, then probably we shouldn't be making that madvise call on every buffer (I imagine it's doing a sha1 on each page and then examining a tree) — especially since we (probably) flush all that data out relatively quickly. And RBD doesn't currently have any information about whether the data is OS or user data… (I guess in future, with layering, we could call madvise on pages which were read from an underlying gold image.)
Also, TV is wondering if the data is even page-aligned or not? I can't recall off-hand.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: wip-librbd-caching
2012-04-12 19:54 ` wip-librbd-caching Tommi Virtanen
@ 2012-04-12 20:20 ` Sage Weil
0 siblings, 0 replies; 9+ messages in thread
From: Sage Weil @ 2012-04-12 20:20 UTC (permalink / raw)
To: Tommi Virtanen; +Cc: Martin Mailand, ceph-devel, Josh Durgin
[-- Attachment #1: Type: TEXT/PLAIN, Size: 962 bytes --]
On Thu, 12 Apr 2012, Tommi Virtanen wrote:
> On Thu, Apr 12, 2012 at 12:45, Sage Weil <sage@newdream.net> wrote:
> >> So maybe we could reduce the memory footprint of the cache, but keep it's
> >> performance.
> >
> > I'm not familiar with the performance implications of KSM, but the
> > objectcacher doesn't modify existing buffers in place, so I suspect it's a
> > good candidate. And it looks like there's minimal effort in enabling
> > it...
>
> Are the objectcacher cache entries full pages, page aligned, with no
> bookkeeping data inside the page? Those are pretty much the
> requirements for page-granularity dedup to work..
Some buffers are, some aren't, but we'd only want to madvise on page
aligned ones. The messenger is careful to read things into aligned
memory, and librbd will only be getting block-sized (probably page-sized,
if we say we have 4k blocks) IO... so that should include every buffer in
this case.
sage
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: wip-librbd-caching
2012-04-12 19:45 ` wip-librbd-caching Sage Weil
` (2 preceding siblings ...)
2012-04-12 19:55 ` wip-librbd-caching Greg Farnum
@ 2012-04-18 12:50 ` Martin Mailand
2012-04-18 16:27 ` wip-librbd-caching Greg Farnum
2012-04-18 17:44 ` wip-librbd-caching Sage Weil
3 siblings, 2 replies; 9+ messages in thread
From: Martin Mailand @ 2012-04-18 12:50 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel, Josh Durgin
Am 12.04.2012 21:45, schrieb Sage Weil:
> The config options you'll want to look at are client_oc_* (in case you
> didn't see that already :). "oc" is short for objectcacher, and it isn't
> only used for client (libcephfs), so it might be worth renaming these
> options before people start using them.
Hi,
I changed the values and the performance is still very good and the
memory footprint is much smaller.
OPTION(client_oc_size, OPT_INT, 1024*1024* 50) // MB * n
OPTION(client_oc_max_dirty, OPT_INT, 1024*1024* 25) // MB * n (dirty
OR tx.. bigish)
OPTION(client_oc_target_dirty, OPT_INT, 1024*1024* 8) // target dirty
(keep this smallish)
// note: the max amount of "in flight" dirty data is roughly (max - target)
But I am not quite sure about the meaning of the values.
client_oc_size Max size of the cache?
client_oc_max_dirty max dirty value before the writeback starts?
client_oc_target_dirty ???
-martin
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: wip-librbd-caching
2012-04-18 12:50 ` wip-librbd-caching Martin Mailand
@ 2012-04-18 16:27 ` Greg Farnum
2012-04-18 17:44 ` wip-librbd-caching Sage Weil
1 sibling, 0 replies; 9+ messages in thread
From: Greg Farnum @ 2012-04-18 16:27 UTC (permalink / raw)
To: Martin Mailand; +Cc: Sage Weil, ceph-devel, Josh Durgin
On Wednesday, April 18, 2012 at 5:50 AM, Martin Mailand wrote:
> Hi,
>
> I changed the values and the performance is still very good and the
> memory footprint is much smaller.
>
> OPTION(client_oc_size, OPT_INT, 1024*1024* 50) // MB * n
> OPTION(client_oc_max_dirty, OPT_INT, 1024*1024* 25) // MB * n (dirty
> OR tx.. bigish)
> OPTION(client_oc_target_dirty, OPT_INT, 1024*1024* 8) // target dirty
> (keep this smallish)
> // note: the max amount of "in flight" dirty data is roughly (max - target)
>
> But I am not quite sure about the meaning of the values.
> client_oc_size Max size of the cache?
> client_oc_max_dirty max dirty value before the writeback starts?
> client_oc_target_dirty ???
>
Right now the cache writeout algorithms are based on amount of dirty data, rather than something like how long the data has been dirty.
client_oc_size is the max (and therefore typical) size of the cache.
client_oc_max_dirty is the largest amount of dirty data in the cache — if this much is dirty and you try to dirty more, the dirtier (a write of some kind) will block until some of the other dirty data has been committed.
client_oc_target_dirty is the amount of dirty data that will trigger the cache to start flushing data out.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: wip-librbd-caching
2012-04-18 12:50 ` wip-librbd-caching Martin Mailand
2012-04-18 16:27 ` wip-librbd-caching Greg Farnum
@ 2012-04-18 17:44 ` Sage Weil
1 sibling, 0 replies; 9+ messages in thread
From: Sage Weil @ 2012-04-18 17:44 UTC (permalink / raw)
To: Martin Mailand; +Cc: ceph-devel, Josh Durgin
On Wed, 18 Apr 2012, Martin Mailand wrote:
> Am 12.04.2012 21:45, schrieb Sage Weil:
> > The config options you'll want to look at are client_oc_* (in case you
> > didn't see that already :). "oc" is short for objectcacher, and it isn't
> > only used for client (libcephfs), so it might be worth renaming these
> > options before people start using them.
>
> Hi,
>
> I changed the values and the performance is still very good and the memory
> footprint is much smaller.
>
> OPTION(client_oc_size, OPT_INT, 1024*1024* 50) // MB * n
> OPTION(client_oc_max_dirty, OPT_INT, 1024*1024* 25) // MB * n (dirty OR
> tx.. bigish)
> OPTION(client_oc_target_dirty, OPT_INT, 1024*1024* 8) // target dirty (keep
> this smallish)
> // note: the max amount of "in flight" dirty data is roughly (max - target)
>
> But I am not quite sure about the meaning of the values.
> client_oc_size Max size of the cache?
yes
> client_oc_max_dirty max dirty value before the writeback starts?
before writes block and wait for writeback to bring the dirty level down
> client_oc_target_dirty ???
before writeback starts
BTW I renamed 'rbd cache enabled' -> 'rbd cache'. I'd like to rename the
objectcacher settings too so they aren't nested under client_ (which is
the fs client code).
objectcacher_*?
sage
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2012-04-18 17:44 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-12 19:30 wip-librbd-caching Martin Mailand
2012-04-12 19:45 ` wip-librbd-caching Sage Weil
2012-04-12 19:48 ` wip-librbd-caching Damien Churchill
2012-04-12 19:54 ` wip-librbd-caching Tommi Virtanen
2012-04-12 20:20 ` wip-librbd-caching Sage Weil
2012-04-12 19:55 ` wip-librbd-caching Greg Farnum
2012-04-18 12:50 ` wip-librbd-caching Martin Mailand
2012-04-18 16:27 ` wip-librbd-caching Greg Farnum
2012-04-18 17:44 ` wip-librbd-caching Sage Weil
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.