[Feature]Proposal for adding a new flag named shared to support performance and statistic purpose

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Feature]Proposal for adding a new flag named shared to support performance and statistic purpose
@ 2014-06-05  7:01 Haomai Wang
  2014-06-05  7:25 ` Wido den Hollander
  2014-06-10  1:16 ` Josh Durgin
  0 siblings, 2 replies; 10+ messages in thread
From: Haomai Wang @ 2014-06-05  7:01 UTC (permalink / raw)
  To: Sage Weil, Josh Durgin; +Cc: ceph-devel

Hi,
Previously I sent a mail about the difficult of rbd snapshot size
statistic. The main solution is using object map to store the changes.
The problem is we can't handle with multi client concurrent modify.

Lack of object map(like pointer map in qcow2), it cause many problems
in librbd. Such as clone depth, the deep clone depth will cause
remarkable latency. Usually each clone wrap will increase two times of
latency.

I consider to make a tradeoff between multi-client support and
single-client support for librbd. In practice, most of the
volumes/images are used by VM, there only exist one client will
access/modify image. We can't only want to make shared image possible
but make most of use cases bad. So we can add a new flag called
"shared" when creating image. If "shared" is false, librbd will
maintain a object map for each image. The object map is considered to
durable, each image_close call will store the map into rados. If the
client  is crashed and failed to dump the object map, the next client
open the image will think the object map as out of date and reset the
objectmap.

We can easily find the advantage of this feature:
1. Avoid clone performance problem
2. Make snapshot statistic possible
3. Improve librbd operation performance including read, copy-on-write operation.

What do you think above? More feedbacks are appreciate!

-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Feature]Proposal for adding a new flag named shared to support performance and statistic purpose
  2014-06-05  7:01 [Feature]Proposal for adding a new flag named shared to support performance and statistic purpose Haomai Wang
@ 2014-06-05  7:25 ` Wido den Hollander
  2014-06-05  7:43   ` Haomai Wang
  2014-06-10  1:16 ` Josh Durgin
  1 sibling, 1 reply; 10+ messages in thread
From: Wido den Hollander @ 2014-06-05  7:25 UTC (permalink / raw)
  To: Haomai Wang, Sage Weil, Josh Durgin; +Cc: ceph-devel

On 06/05/2014 09:01 AM, Haomai Wang wrote:
> Hi,
> Previously I sent a mail about the difficult of rbd snapshot size
> statistic. The main solution is using object map to store the changes.
> The problem is we can't handle with multi client concurrent modify.
>
> Lack of object map(like pointer map in qcow2), it cause many problems
> in librbd. Such as clone depth, the deep clone depth will cause
> remarkable latency. Usually each clone wrap will increase two times of
> latency.
>
> I consider to make a tradeoff between multi-client support and
> single-client support for librbd. In practice, most of the
> volumes/images are used by VM, there only exist one client will
> access/modify image. We can't only want to make shared image possible
> but make most of use cases bad. So we can add a new flag called
> "shared" when creating image. If "shared" is false, librbd will
> maintain a object map for each image. The object map is considered to
> durable, each image_close call will store the map into rados. If the
> client  is crashed and failed to dump the object map, the next client
> open the image will think the object map as out of date and reset the
> objectmap.

Why not flush out the object map every X period? Assume a client runs 
for weeks or months and you would keep that map in memory all the time 
since the image is never closed.

>
> We can easily find the advantage of this feature:
> 1. Avoid clone performance problem
> 2. Make snapshot statistic possible
> 3. Improve librbd operation performance including read, copy-on-write operation.
>
> What do you think above? More feedbacks are appreciate!
>


-- 
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Feature]Proposal for adding a new flag named shared to support performance and statistic purpose
  2014-06-05  7:25 ` Wido den Hollander
@ 2014-06-05  7:43   ` Haomai Wang
  2014-06-05 13:55     ` Allen Samuels
  0 siblings, 1 reply; 10+ messages in thread
From: Haomai Wang @ 2014-06-05  7:43 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: Sage Weil, Josh Durgin, ceph-devel

On Thu, Jun 5, 2014 at 3:25 PM, Wido den Hollander <wido@42on.com> wrote:
> On 06/05/2014 09:01 AM, Haomai Wang wrote:
>>
>> Hi,
>> Previously I sent a mail about the difficult of rbd snapshot size
>> statistic. The main solution is using object map to store the changes.
>> The problem is we can't handle with multi client concurrent modify.
>>
>> Lack of object map(like pointer map in qcow2), it cause many problems
>> in librbd. Such as clone depth, the deep clone depth will cause
>> remarkable latency. Usually each clone wrap will increase two times of
>> latency.
>>
>> I consider to make a tradeoff between multi-client support and
>> single-client support for librbd. In practice, most of the
>> volumes/images are used by VM, there only exist one client will
>> access/modify image. We can't only want to make shared image possible
>> but make most of use cases bad. So we can add a new flag called
>> "shared" when creating image. If "shared" is false, librbd will
>> maintain a object map for each image. The object map is considered to
>> durable, each image_close call will store the map into rados. If the
>> client  is crashed and failed to dump the object map, the next client
>> open the image will think the object map as out of date and reset the
>> objectmap.
>
>
> Why not flush out the object map every X period? Assume a client runs for
> weeks or months and you would keep that map in memory all the time since the
> image is never closed.

Yes, as a period job is also a good alter

>
>
>>
>> We can easily find the advantage of this feature:
>> 1. Avoid clone performance problem
>> 2. Make snapshot statistic possible
>> 3. Improve librbd operation performance including read, copy-on-write
>> operation.
>>
>> What do you think above? More feedbacks are appreciate!
>>
>
>
> --
> Wido den Hollander
> 42on B.V.
>
> Phone: +31 (0)20 700 9902
> Skype: contact42on



-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [Feature]Proposal for adding a new flag named shared to support performance and statistic purpose
  2014-06-05  7:43   ` Haomai Wang
@ 2014-06-05 13:55     ` Allen Samuels
  2014-06-05 14:40       ` Haomai Wang
  0 siblings, 1 reply; 10+ messages in thread
From: Allen Samuels @ 2014-06-05 13:55 UTC (permalink / raw)
  To: Haomai Wang, Wido den Hollander; +Cc: Sage Weil, Josh Durgin, ceph-devel

You talk about restting the object map on a restart after a crash -- I assume you mean rebuilding, how long will this take?


-----------------------------------------------------------
The true mystery of the world is the visible, not the invisible.
 Oscar Wilde (1854 - 1900)

Allen Samuels
Chief Software Architect, Emerging Storage Solutions

951 SanDisk Drive, Milpitas, CA 95035
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Haomai Wang
Sent: Thursday, June 05, 2014 12:43 AM
To: Wido den Hollander
Cc: Sage Weil; Josh Durgin; ceph-devel@vger.kernel.org
Subject: Re: [Feature]Proposal for adding a new flag named shared to support performance and statistic purpose

On Thu, Jun 5, 2014 at 3:25 PM, Wido den Hollander <wido@42on.com> wrote:
> On 06/05/2014 09:01 AM, Haomai Wang wrote:
>>
>> Hi,
>> Previously I sent a mail about the difficult of rbd snapshot size
>> statistic. The main solution is using object map to store the changes.
>> The problem is we can't handle with multi client concurrent modify.
>>
>> Lack of object map(like pointer map in qcow2), it cause many problems
>> in librbd. Such as clone depth, the deep clone depth will cause
>> remarkable latency. Usually each clone wrap will increase two times
>> of latency.
>>
>> I consider to make a tradeoff between multi-client support and
>> single-client support for librbd. In practice, most of the
>> volumes/images are used by VM, there only exist one client will
>> access/modify image. We can't only want to make shared image possible
>> but make most of use cases bad. So we can add a new flag called
>> "shared" when creating image. If "shared" is false, librbd will
>> maintain a object map for each image. The object map is considered to
>> durable, each image_close call will store the map into rados. If the
>> client  is crashed and failed to dump the object map, the next client
>> open the image will think the object map as out of date and reset the
>> objectmap.
>
>
> Why not flush out the object map every X period? Assume a client runs
> for weeks or months and you would keep that map in memory all the time
> since the image is never closed.

Yes, as a period job is also a good alter

>
>
>>
>> We can easily find the advantage of this feature:
>> 1. Avoid clone performance problem
>> 2. Make snapshot statistic possible
>> 3. Improve librbd operation performance including read, copy-on-write
>> operation.
>>
>> What do you think above? More feedbacks are appreciate!
>>
>
>
> --
> Wido den Hollander
> 42on B.V.
>
> Phone: +31 (0)20 700 9902
> Skype: contact42on



--
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Feature]Proposal for adding a new flag named shared to support performance and statistic purpose
  2014-06-05 13:55     ` Allen Samuels
@ 2014-06-05 14:40       ` Haomai Wang
  0 siblings, 0 replies; 10+ messages in thread
From: Haomai Wang @ 2014-06-05 14:40 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Wido den Hollander, Sage Weil, Josh Durgin, ceph-devel

On Thu, Jun 5, 2014 at 9:55 PM, Allen Samuels <Allen.Samuels@sandisk.com> wrote:
> You talk about restting the object map on a restart after a crash -- I assume you mean rebuilding, how long will this take?

The object map can be regarded as a state cache. So The object map
after crash will make all object state in objectmap "unknown", this
mean only when client access the object, the state will be updated. So
the object map won't rebuild when image opens, it only affect runtime
condition.

>
>
> -----------------------------------------------------------
> The true mystery of the world is the visible, not the invisible.
>  Oscar Wilde (1854 - 1900)
>
> Allen Samuels
> Chief Software Architect, Emerging Storage Solutions
>
> 951 SanDisk Drive, Milpitas, CA 95035
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
>
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Haomai Wang
> Sent: Thursday, June 05, 2014 12:43 AM
> To: Wido den Hollander
> Cc: Sage Weil; Josh Durgin; ceph-devel@vger.kernel.org
> Subject: Re: [Feature]Proposal for adding a new flag named shared to support performance and statistic purpose
>
> On Thu, Jun 5, 2014 at 3:25 PM, Wido den Hollander <wido@42on.com> wrote:
>> On 06/05/2014 09:01 AM, Haomai Wang wrote:
>>>
>>> Hi,
>>> Previously I sent a mail about the difficult of rbd snapshot size
>>> statistic. The main solution is using object map to store the changes.
>>> The problem is we can't handle with multi client concurrent modify.
>>>
>>> Lack of object map(like pointer map in qcow2), it cause many problems
>>> in librbd. Such as clone depth, the deep clone depth will cause
>>> remarkable latency. Usually each clone wrap will increase two times
>>> of latency.
>>>
>>> I consider to make a tradeoff between multi-client support and
>>> single-client support for librbd. In practice, most of the
>>> volumes/images are used by VM, there only exist one client will
>>> access/modify image. We can't only want to make shared image possible
>>> but make most of use cases bad. So we can add a new flag called
>>> "shared" when creating image. If "shared" is false, librbd will
>>> maintain a object map for each image. The object map is considered to
>>> durable, each image_close call will store the map into rados. If the
>>> client  is crashed and failed to dump the object map, the next client
>>> open the image will think the object map as out of date and reset the
>>> objectmap.
>>
>>
>> Why not flush out the object map every X period? Assume a client runs
>> for weeks or months and you would keep that map in memory all the time
>> since the image is never closed.
>
> Yes, as a period job is also a good alter
>
>>
>>
>>>
>>> We can easily find the advantage of this feature:
>>> 1. Avoid clone performance problem
>>> 2. Make snapshot statistic possible
>>> 3. Improve librbd operation performance including read, copy-on-write
>>> operation.
>>>
>>> What do you think above? More feedbacks are appreciate!
>>>
>>
>>
>> --
>> Wido den Hollander
>> 42on B.V.
>>
>> Phone: +31 (0)20 700 9902
>> Skype: contact42on
>
>
>
> --
> Best Regards,
>
> Wheat
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>



-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Feature]Proposal for adding a new flag named shared to support performance and statistic purpose
  2014-06-05  7:01 [Feature]Proposal for adding a new flag named shared to support performance and statistic purpose Haomai Wang
  2014-06-05  7:25 ` Wido den Hollander
@ 2014-06-10  1:16 ` Josh Durgin
  2014-06-10  6:52   ` Haomai Wang
  1 sibling, 1 reply; 10+ messages in thread
From: Josh Durgin @ 2014-06-10  1:16 UTC (permalink / raw)
  To: Haomai Wang, Sage Weil; +Cc: ceph-devel

On 06/05/2014 12:01 AM, Haomai Wang wrote:
> Hi,
> Previously I sent a mail about the difficult of rbd snapshot size
> statistic. The main solution is using object map to store the changes.
> The problem is we can't handle with multi client concurrent modify.
>
> Lack of object map(like pointer map in qcow2), it cause many problems
> in librbd. Such as clone depth, the deep clone depth will cause
> remarkable latency. Usually each clone wrap will increase two times of
> latency.
>
> I consider to make a tradeoff between multi-client support and
> single-client support for librbd. In practice, most of the
> volumes/images are used by VM, there only exist one client will
> access/modify image. We can't only want to make shared image possible
> but make most of use cases bad. So we can add a new flag called
> "shared" when creating image. If "shared" is false, librbd will
> maintain a object map for each image. The object map is considered to
> durable, each image_close call will store the map into rados. If the
> client  is crashed and failed to dump the object map, the next client
> open the image will think the object map as out of date and reset the
> objectmap.
>
> We can easily find the advantage of this feature:
> 1. Avoid clone performance problem
> 2. Make snapshot statistic possible
> 3. Improve librbd operation performance including read, copy-on-write
> operation.
>
> What do you think above? More feedbacks are appreciate!

I think it's a great idea! We discussed this a little at the last cds
[1]. I like the idea of the shared flag on an image. Since the vastly
more common case is single-client, I'd go further and suggest that
we treat images as if shared is false by default if the flag is not
present (perhaps with a config option to change this default behavior).

That way existing images can benefit from the feature without extra
configuration. There can be an rbd command to toggle the shared flag as
well, so users of ocfs2 or gfs2 or other multi-client-writing systems
can upgrade and set shared to true before restarting their clients.

Another thing to consider is the granularity of the object map. The
coarse granularity of a bitmap of object existence would be simplest,
and most useful for in-memory comparison for clones. For statistics
it might be desirable in the future to have a finer-grained index of
data existence in the image. To make that easy to handle, the on-disk
format could be a list of extents (byte ranges).

Another potential use case would be a mode in which the index is
treated as authoritative. This could make discard very fast, for
example. I'm not sure it could be done safely with only binary
'exists/does not exist' information though - a third 'unknown' state
might be needed for some cases. If this kind of index is actually useful
(I'm not sure there are cases where the performance penalty would be
worth it), we could add a new index format if we need it.

Back to the currently proposed design, to be safe with live migration
we'd need to make sure the index is consistent in the destination
process. Using rados_notify() after we set the clean flag on the index
can make the destination vm re-read the index before any I/O
happens. This might be a good time to introduce a data payload to the
notify as well, so we can only re-read the index, instead of all the
header metadata. Rereading the index after cache invalidation and wiring
that up through qemu's bdrv_invalidate() would be even better.

There's more to consider in implementing this wrt snapshots, but this
email has gone on long enough.

Josh

[1] http://pad.ceph.com/p/cdsgiant-rbd-copy-on-read-for-clones

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Feature]Proposal for adding a new flag named shared to support performance and statistic purpose
  2014-06-10  1:16 ` Josh Durgin
@ 2014-06-10  6:52   ` Haomai Wang
  2014-06-10 19:38     ` Josh Durgin
  0 siblings, 1 reply; 10+ messages in thread
From: Haomai Wang @ 2014-06-10  6:52 UTC (permalink / raw)
  To: Josh Durgin; +Cc: Sage Weil, ceph-devel

Thanks, Josh!

Your points are really helpful. Maybe we can schedule this bp to the
near CDS? The implementation I hope can has great performance effects
on librbd.



On Tue, Jun 10, 2014 at 9:16 AM, Josh Durgin <josh.durgin@inktank.com> wrote:
> On 06/05/2014 12:01 AM, Haomai Wang wrote:
>> Hi,
>> Previously I sent a mail about the difficult of rbd snapshot size
>> statistic. The main solution is using object map to store the changes.
>> The problem is we can't handle with multi client concurrent modify.
>>
>> Lack of object map(like pointer map in qcow2), it cause many problems
>> in librbd. Such as clone depth, the deep clone depth will cause
>> remarkable latency. Usually each clone wrap will increase two times of
>> latency.
>>
>> I consider to make a tradeoff between multi-client support and
>> single-client support for librbd. In practice, most of the
>> volumes/images are used by VM, there only exist one client will
>> access/modify image. We can't only want to make shared image possible
>> but make most of use cases bad. So we can add a new flag called
>> "shared" when creating image. If "shared" is false, librbd will
>> maintain a object map for each image. The object map is considered to
>> durable, each image_close call will store the map into rados. If the
>> client  is crashed and failed to dump the object map, the next client
>> open the image will think the object map as out of date and reset the
>> objectmap.
>>
>> We can easily find the advantage of this feature:
>> 1. Avoid clone performance problem
>> 2. Make snapshot statistic possible
>> 3. Improve librbd operation performance including read, copy-on-write
>> operation.
>>
>> What do you think above? More feedbacks are appreciate!
>
> I think it's a great idea! We discussed this a little at the last cds
> [1]. I like the idea of the shared flag on an image. Since the vastly
> more common case is single-client, I'd go further and suggest that
> we treat images as if shared is false by default if the flag is not
> present (perhaps with a config option to change this default behavior).
>
> That way existing images can benefit from the feature without extra
> configuration. There can be an rbd command to toggle the shared flag as
> well, so users of ocfs2 or gfs2 or other multi-client-writing systems
> can upgrade and set shared to true before restarting their clients.
>
> Another thing to consider is the granularity of the object map. The
> coarse granularity of a bitmap of object existence would be simplest,
> and most useful for in-memory comparison for clones. For statistics
> it might be desirable in the future to have a finer-grained index of
> data existence in the image. To make that easy to handle, the on-disk
> format could be a list of extents (byte ranges).
>
> Another potential use case would be a mode in which the index is
> treated as authoritative. This could make discard very fast, for
> example. I'm not sure it could be done safely with only binary
> 'exists/does not exist' information though - a third 'unknown' state
> might be needed for some cases. If this kind of index is actually useful
> (I'm not sure there are cases where the performance penalty would be
> worth it), we could add a new index format if we need it.
>
> Back to the currently proposed design, to be safe with live migration
> we'd need to make sure the index is consistent in the destination
> process. Using rados_notify() after we set the clean flag on the index
> can make the destination vm re-read the index before any I/O
> happens. This might be a good time to introduce a data payload to the
> notify as well, so we can only re-read the index, instead of all the
> header metadata. Rereading the index after cache invalidation and wiring
> that up through qemu's bdrv_invalidate() would be even better.
>
> There's more to consider in implementing this wrt snapshots, but this
> email has gone on long enough.
>
> Josh
>
> [1] http://pad.ceph.com/p/cdsgiant-rbd-copy-on-read-for-clones



-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Feature]Proposal for adding a new flag named shared to support performance and statistic purpose
  2014-06-10  6:52   ` Haomai Wang
@ 2014-06-10 19:38     ` Josh Durgin
  2014-06-11  4:01       ` Gregory Farnum
  0 siblings, 1 reply; 10+ messages in thread
From: Josh Durgin @ 2014-06-10 19:38 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Sage Weil, ceph-devel

On Tue, 10 Jun 2014 14:52:54 +0800
Haomai Wang <haomaiwang@gmail.com> wrote:

> Thanks, Josh!
> 
> Your points are really helpful. Maybe we can schedule this bp to the
> near CDS? The implementation I hope can has great performance effects
> on librbd.

It'd be great to discuss it more at CDS. Could you add a blueprint for
it on the wiki:

https://wiki.ceph.com/Planning/Blueprints/Submissions 

Josh

> On Tue, Jun 10, 2014 at 9:16 AM, Josh Durgin
> <josh.durgin@inktank.com> wrote:
> > On 06/05/2014 12:01 AM, Haomai Wang wrote:
> >> Hi,
> >> Previously I sent a mail about the difficult of rbd snapshot size
> >> statistic. The main solution is using object map to store the
> >> changes. The problem is we can't handle with multi client
> >> concurrent modify.
> >>
> >> Lack of object map(like pointer map in qcow2), it cause many
> >> problems in librbd. Such as clone depth, the deep clone depth will
> >> cause remarkable latency. Usually each clone wrap will increase
> >> two times of latency.
> >>
> >> I consider to make a tradeoff between multi-client support and
> >> single-client support for librbd. In practice, most of the
> >> volumes/images are used by VM, there only exist one client will
> >> access/modify image. We can't only want to make shared image
> >> possible but make most of use cases bad. So we can add a new flag
> >> called "shared" when creating image. If "shared" is false, librbd
> >> will maintain a object map for each image. The object map is
> >> considered to durable, each image_close call will store the map
> >> into rados. If the client  is crashed and failed to dump the
> >> object map, the next client open the image will think the object
> >> map as out of date and reset the objectmap.
> >>
> >> We can easily find the advantage of this feature:
> >> 1. Avoid clone performance problem
> >> 2. Make snapshot statistic possible
> >> 3. Improve librbd operation performance including read,
> >> copy-on-write operation.
> >>
> >> What do you think above? More feedbacks are appreciate!
> >
> > I think it's a great idea! We discussed this a little at the last
> > cds [1]. I like the idea of the shared flag on an image. Since the
> > vastly more common case is single-client, I'd go further and
> > suggest that we treat images as if shared is false by default if
> > the flag is not present (perhaps with a config option to change
> > this default behavior).
> >
> > That way existing images can benefit from the feature without extra
> > configuration. There can be an rbd command to toggle the shared
> > flag as well, so users of ocfs2 or gfs2 or other
> > multi-client-writing systems can upgrade and set shared to true
> > before restarting their clients.
> >
> > Another thing to consider is the granularity of the object map. The
> > coarse granularity of a bitmap of object existence would be
> > simplest, and most useful for in-memory comparison for clones. For
> > statistics it might be desirable in the future to have a
> > finer-grained index of data existence in the image. To make that
> > easy to handle, the on-disk format could be a list of extents (byte
> > ranges).
> >
> > Another potential use case would be a mode in which the index is
> > treated as authoritative. This could make discard very fast, for
> > example. I'm not sure it could be done safely with only binary
> > 'exists/does not exist' information though - a third 'unknown' state
> > might be needed for some cases. If this kind of index is actually
> > useful (I'm not sure there are cases where the performance penalty
> > would be worth it), we could add a new index format if we need it.
> >
> > Back to the currently proposed design, to be safe with live
> > migration we'd need to make sure the index is consistent in the
> > destination process. Using rados_notify() after we set the clean
> > flag on the index can make the destination vm re-read the index
> > before any I/O happens. This might be a good time to introduce a
> > data payload to the notify as well, so we can only re-read the
> > index, instead of all the header metadata. Rereading the index
> > after cache invalidation and wiring that up through qemu's
> > bdrv_invalidate() would be even better.
> >
> > There's more to consider in implementing this wrt snapshots, but
> > this email has gone on long enough.
> >
> > Josh
> >
> > [1] http://pad.ceph.com/p/cdsgiant-rbd-copy-on-read-for-clones
> 
> 
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Feature]Proposal for adding a new flag named shared to support performance and statistic purpose
  2014-06-10 19:38     ` Josh Durgin
@ 2014-06-11  4:01       ` Gregory Farnum
  2014-07-14 14:34         ` Haomai Wang
  0 siblings, 1 reply; 10+ messages in thread
From: Gregory Farnum @ 2014-06-11  4:01 UTC (permalink / raw)
  To: Josh Durgin, Haomai Wang; +Cc: ceph-devel

We discussed a great deal of this during the initial format 2 work as
well, when we were thinking about having bitmaps of allocated space.
(Although we also have interval sets which might be a better fit?) I
think there was more thought behind it than is in the copy-on-read
blueprint; do you know if we have it written down anywhere, Josh?
-Greg

On Tue, Jun 10, 2014 at 12:38 PM, Josh Durgin <josh.durgin@inktank.com> wrote:
> On Tue, 10 Jun 2014 14:52:54 +0800
> Haomai Wang <haomaiwang@gmail.com> wrote:
>
>> Thanks, Josh!
>>
>> Your points are really helpful. Maybe we can schedule this bp to the
>> near CDS? The implementation I hope can has great performance effects
>> on librbd.
>
> It'd be great to discuss it more at CDS. Could you add a blueprint for
> it on the wiki:
>
> https://wiki.ceph.com/Planning/Blueprints/Submissions
>
> Josh
>
>> On Tue, Jun 10, 2014 at 9:16 AM, Josh Durgin
>> <josh.durgin@inktank.com> wrote:
>> > On 06/05/2014 12:01 AM, Haomai Wang wrote:
>> >> Hi,
>> >> Previously I sent a mail about the difficult of rbd snapshot size
>> >> statistic. The main solution is using object map to store the
>> >> changes. The problem is we can't handle with multi client
>> >> concurrent modify.
>> >>
>> >> Lack of object map(like pointer map in qcow2), it cause many
>> >> problems in librbd. Such as clone depth, the deep clone depth will
>> >> cause remarkable latency. Usually each clone wrap will increase
>> >> two times of latency.
>> >>
>> >> I consider to make a tradeoff between multi-client support and
>> >> single-client support for librbd. In practice, most of the
>> >> volumes/images are used by VM, there only exist one client will
>> >> access/modify image. We can't only want to make shared image
>> >> possible but make most of use cases bad. So we can add a new flag
>> >> called "shared" when creating image. If "shared" is false, librbd
>> >> will maintain a object map for each image. The object map is
>> >> considered to durable, each image_close call will store the map
>> >> into rados. If the client  is crashed and failed to dump the
>> >> object map, the next client open the image will think the object
>> >> map as out of date and reset the objectmap.
>> >>
>> >> We can easily find the advantage of this feature:
>> >> 1. Avoid clone performance problem
>> >> 2. Make snapshot statistic possible
>> >> 3. Improve librbd operation performance including read,
>> >> copy-on-write operation.
>> >>
>> >> What do you think above? More feedbacks are appreciate!
>> >
>> > I think it's a great idea! We discussed this a little at the last
>> > cds [1]. I like the idea of the shared flag on an image. Since the
>> > vastly more common case is single-client, I'd go further and
>> > suggest that we treat images as if shared is false by default if
>> > the flag is not present (perhaps with a config option to change
>> > this default behavior).
>> >
>> > That way existing images can benefit from the feature without extra
>> > configuration. There can be an rbd command to toggle the shared
>> > flag as well, so users of ocfs2 or gfs2 or other
>> > multi-client-writing systems can upgrade and set shared to true
>> > before restarting their clients.
>> >
>> > Another thing to consider is the granularity of the object map. The
>> > coarse granularity of a bitmap of object existence would be
>> > simplest, and most useful for in-memory comparison for clones. For
>> > statistics it might be desirable in the future to have a
>> > finer-grained index of data existence in the image. To make that
>> > easy to handle, the on-disk format could be a list of extents (byte
>> > ranges).
>> >
>> > Another potential use case would be a mode in which the index is
>> > treated as authoritative. This could make discard very fast, for
>> > example. I'm not sure it could be done safely with only binary
>> > 'exists/does not exist' information though - a third 'unknown' state
>> > might be needed for some cases. If this kind of index is actually
>> > useful (I'm not sure there are cases where the performance penalty
>> > would be worth it), we could add a new index format if we need it.
>> >
>> > Back to the currently proposed design, to be safe with live
>> > migration we'd need to make sure the index is consistent in the
>> > destination process. Using rados_notify() after we set the clean
>> > flag on the index can make the destination vm re-read the index
>> > before any I/O happens. This might be a good time to introduce a
>> > data payload to the notify as well, so we can only re-read the
>> > index, instead of all the header metadata. Rereading the index
>> > after cache invalidation and wiring that up through qemu's
>> > bdrv_invalidate() would be even better.
>> >
>> > There's more to consider in implementing this wrt snapshots, but
>> > this email has gone on long enough.
>> >
>> > Josh
>> >
>> > [1] http://pad.ceph.com/p/cdsgiant-rbd-copy-on-read-for-clones
>>
>>
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Feature]Proposal for adding a new flag named shared to support performance and statistic purpose
  2014-06-11  4:01       ` Gregory Farnum
@ 2014-07-14 14:34         ` Haomai Wang
  0 siblings, 0 replies; 10+ messages in thread
From: Haomai Wang @ 2014-07-14 14:34 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Josh Durgin, ceph-devel

Hi all,

I have viewed the discuss video on Ceph CDS. By the way, sorry for the
absence because of something urgent.

It seemed that we have two ways to implement it, one is lightweight
another is complex. I like the simple one which prefer invalidating
cache and let librbd reload/lazy load object state. And the most
important one is implementing a performance optimized
Index(ObjectMap).

Is there exists progress Josh? I think we could push further based on
discuss. Or I missed something?



On Wed, Jun 11, 2014 at 12:01 PM, Gregory Farnum <greg@inktank.com> wrote:
> We discussed a great deal of this during the initial format 2 work as
> well, when we were thinking about having bitmaps of allocated space.
> (Although we also have interval sets which might be a better fit?) I
> think there was more thought behind it than is in the copy-on-read
> blueprint; do you know if we have it written down anywhere, Josh?
> -Greg
>
> On Tue, Jun 10, 2014 at 12:38 PM, Josh Durgin <josh.durgin@inktank.com> wrote:
>> On Tue, 10 Jun 2014 14:52:54 +0800
>> Haomai Wang <haomaiwang@gmail.com> wrote:
>>
>>> Thanks, Josh!
>>>
>>> Your points are really helpful. Maybe we can schedule this bp to the
>>> near CDS? The implementation I hope can has great performance effects
>>> on librbd.
>>
>> It'd be great to discuss it more at CDS. Could you add a blueprint for
>> it on the wiki:
>>
>> https://wiki.ceph.com/Planning/Blueprints/Submissions
>>
>> Josh
>>
>>> On Tue, Jun 10, 2014 at 9:16 AM, Josh Durgin
>>> <josh.durgin@inktank.com> wrote:
>>> > On 06/05/2014 12:01 AM, Haomai Wang wrote:
>>> >> Hi,
>>> >> Previously I sent a mail about the difficult of rbd snapshot size
>>> >> statistic. The main solution is using object map to store the
>>> >> changes. The problem is we can't handle with multi client
>>> >> concurrent modify.
>>> >>
>>> >> Lack of object map(like pointer map in qcow2), it cause many
>>> >> problems in librbd. Such as clone depth, the deep clone depth will
>>> >> cause remarkable latency. Usually each clone wrap will increase
>>> >> two times of latency.
>>> >>
>>> >> I consider to make a tradeoff between multi-client support and
>>> >> single-client support for librbd. In practice, most of the
>>> >> volumes/images are used by VM, there only exist one client will
>>> >> access/modify image. We can't only want to make shared image
>>> >> possible but make most of use cases bad. So we can add a new flag
>>> >> called "shared" when creating image. If "shared" is false, librbd
>>> >> will maintain a object map for each image. The object map is
>>> >> considered to durable, each image_close call will store the map
>>> >> into rados. If the client  is crashed and failed to dump the
>>> >> object map, the next client open the image will think the object
>>> >> map as out of date and reset the objectmap.
>>> >>
>>> >> We can easily find the advantage of this feature:
>>> >> 1. Avoid clone performance problem
>>> >> 2. Make snapshot statistic possible
>>> >> 3. Improve librbd operation performance including read,
>>> >> copy-on-write operation.
>>> >>
>>> >> What do you think above? More feedbacks are appreciate!
>>> >
>>> > I think it's a great idea! We discussed this a little at the last
>>> > cds [1]. I like the idea of the shared flag on an image. Since the
>>> > vastly more common case is single-client, I'd go further and
>>> > suggest that we treat images as if shared is false by default if
>>> > the flag is not present (perhaps with a config option to change
>>> > this default behavior).
>>> >
>>> > That way existing images can benefit from the feature without extra
>>> > configuration. There can be an rbd command to toggle the shared
>>> > flag as well, so users of ocfs2 or gfs2 or other
>>> > multi-client-writing systems can upgrade and set shared to true
>>> > before restarting their clients.
>>> >
>>> > Another thing to consider is the granularity of the object map. The
>>> > coarse granularity of a bitmap of object existence would be
>>> > simplest, and most useful for in-memory comparison for clones. For
>>> > statistics it might be desirable in the future to have a
>>> > finer-grained index of data existence in the image. To make that
>>> > easy to handle, the on-disk format could be a list of extents (byte
>>> > ranges).
>>> >
>>> > Another potential use case would be a mode in which the index is
>>> > treated as authoritative. This could make discard very fast, for
>>> > example. I'm not sure it could be done safely with only binary
>>> > 'exists/does not exist' information though - a third 'unknown' state
>>> > might be needed for some cases. If this kind of index is actually
>>> > useful (I'm not sure there are cases where the performance penalty
>>> > would be worth it), we could add a new index format if we need it.
>>> >
>>> > Back to the currently proposed design, to be safe with live
>>> > migration we'd need to make sure the index is consistent in the
>>> > destination process. Using rados_notify() after we set the clean
>>> > flag on the index can make the destination vm re-read the index
>>> > before any I/O happens. This might be a good time to introduce a
>>> > data payload to the notify as well, so we can only re-read the
>>> > index, instead of all the header metadata. Rereading the index
>>> > after cache invalidation and wiring that up through qemu's
>>> > bdrv_invalidate() would be even better.
>>> >
>>> > There's more to consider in implementing this wrt snapshots, but
>>> > this email has gone on long enough.
>>> >
>>> > Josh
>>> >
>>> > [1] http://pad.ceph.com/p/cdsgiant-rbd-copy-on-read-for-clones
>>>
>>>
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-07-14 14:34 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-05  7:01 [Feature]Proposal for adding a new flag named shared to support performance and statistic purpose Haomai Wang
2014-06-05  7:25 ` Wido den Hollander
2014-06-05  7:43   ` Haomai Wang
2014-06-05 13:55     ` Allen Samuels
2014-06-05 14:40       ` Haomai Wang
2014-06-10  1:16 ` Josh Durgin
2014-06-10  6:52   ` Haomai Wang
2014-06-10 19:38     ` Josh Durgin
2014-06-11  4:01       ` Gregory Farnum
2014-07-14 14:34         ` Haomai Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.