All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Why does Erasure-pool not support omap?
       [not found] <201710251652060421729@zte.com.cn>
@ 2017-10-25 12:16 ` Sage Weil
  2017-10-25 18:57   ` Josh Durgin
  0 siblings, 1 reply; 15+ messages in thread
From: Sage Weil @ 2017-10-25 12:16 UTC (permalink / raw)
  To: xie.xingguo; +Cc: ceph-devel, jdurgin, gfarnum

[-- Attachment #1: Type: TEXT/PLAIN, Size: 716 bytes --]

Hi Xingguo,

On Wed, 25 Oct 2017, xie.xingguo@zte.com.cn wrote:
>       I wonder why erasure-pools can not support omap currently.
> 
>       The simplest way for erasure-pools to support omap I can figure out would be duplicating omap on every shard.
> 
>       It is because it consumes too much space when k + m gets bigger?

Right.  There isn't a nontrivial way to actually erasure code it, and 
duplicating on every shard is inefficient.

One reasonableish approach would be to replicate the omap data on m+1 
shards.  But it's a bit of work to implement and nobody has done it.

I can't remember if there were concerns with this approach or it was just 
a matter of time/resources... Josh? Greg?

sage

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Why does Erasure-pool not support omap?
  2017-10-25 12:16 ` Why does Erasure-pool not support omap? Sage Weil
@ 2017-10-25 18:57   ` Josh Durgin
  2017-10-26 14:20     ` Gregory Farnum
  0 siblings, 1 reply; 15+ messages in thread
From: Josh Durgin @ 2017-10-25 18:57 UTC (permalink / raw)
  To: Sage Weil, xie.xingguo; +Cc: ceph-devel, gfarnum

On 10/25/2017 05:16 AM, Sage Weil wrote:
> Hi Xingguo,
> 
> On Wed, 25 Oct 2017, xie.xingguo@zte.com.cn wrote:
>>        I wonder why erasure-pools can not support omap currently.
>>
>>        The simplest way for erasure-pools to support omap I can figure out would be duplicating omap on every shard.
>>
>>        It is because it consumes too much space when k + m gets bigger?
> 
> Right.  There isn't a nontrivial way to actually erasure code it, and
> duplicating on every shard is inefficient.
> 
> One reasonableish approach would be to replicate the omap data on m+1
> shards.  But it's a bit of work to implement and nobody has done it.
> 
> I can't remember if there were concerns with this approach or it was just
> a matter of time/resources... Josh? Greg?

It restricts us to erasure codes like reed-solomon where a subset of 
shards are always updated. I think this is a reasonable trade-off 
though, it's just a matter of implementing it. We haven't written
up the required peering changes, but they did not seem too difficult to
implement.

Some notes on the approach are here - just think of 'replicating omap'
as a partial write to m+1 shards:

http://pad.ceph.com/p/ec-partial-writes

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Why does Erasure-pool not support omap?
  2017-10-25 18:57   ` Josh Durgin
@ 2017-10-26 14:20     ` Gregory Farnum
  2017-10-26 14:26       ` Sage Weil
  0 siblings, 1 reply; 15+ messages in thread
From: Gregory Farnum @ 2017-10-26 14:20 UTC (permalink / raw)
  To: Josh Durgin; +Cc: Sage Weil, Xie Xingguo, ceph-devel

On Wed, Oct 25, 2017 at 8:57 PM, Josh Durgin <jdurgin@redhat.com> wrote:
> On 10/25/2017 05:16 AM, Sage Weil wrote:
>>
>> Hi Xingguo,
>>
>> On Wed, 25 Oct 2017, xie.xingguo@zte.com.cn wrote:
>>>
>>>        I wonder why erasure-pools can not support omap currently.
>>>
>>>        The simplest way for erasure-pools to support omap I can figure
>>> out would be duplicating omap on every shard.
>>>
>>>        It is because it consumes too much space when k + m gets bigger?
>>
>>
>> Right.  There isn't a nontrivial way to actually erasure code it, and
>> duplicating on every shard is inefficient.
>>
>> One reasonableish approach would be to replicate the omap data on m+1
>> shards.  But it's a bit of work to implement and nobody has done it.
>>
>> I can't remember if there were concerns with this approach or it was just
>> a matter of time/resources... Josh? Greg?
>
>
> It restricts us to erasure codes like reed-solomon where a subset of shards
> are always updated. I think this is a reasonable trade-off though, it's just
> a matter of implementing it. We haven't written
> up the required peering changes, but they did not seem too difficult to
> implement.
>
> Some notes on the approach are here - just think of 'replicating omap'
> as a partial write to m+1 shards:
>
> http://pad.ceph.com/p/ec-partial-writes

Yeah. To expand a bit on why this only works for Reed-Solomon,
consider the minimum and appropriate number of copies — and the actual
shard placement — for local recovery codes. :/ We were unable to
generalize for that (or indeed for SHEC, IIRC) when whiteboarding.

I'm also still nervous that this might do weird things to our recovery
and availability patterns in more complex failure cases, but I don't
have any concrete issues.
-Greg

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Why does Erasure-pool not support omap?
  2017-10-26 14:20     ` Gregory Farnum
@ 2017-10-26 14:26       ` Sage Weil
  2017-10-26 15:07         ` Matt Benjamin
                           ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Sage Weil @ 2017-10-26 14:26 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Josh Durgin, Xie Xingguo, ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2493 bytes --]

On Thu, 26 Oct 2017, Gregory Farnum wrote:
> On Wed, Oct 25, 2017 at 8:57 PM, Josh Durgin <jdurgin@redhat.com> wrote:
> > On 10/25/2017 05:16 AM, Sage Weil wrote:
> >>
> >> Hi Xingguo,
> >>
> >> On Wed, 25 Oct 2017, xie.xingguo@zte.com.cn wrote:
> >>>
> >>>        I wonder why erasure-pools can not support omap currently.
> >>>
> >>>        The simplest way for erasure-pools to support omap I can figure
> >>> out would be duplicating omap on every shard.
> >>>
> >>>        It is because it consumes too much space when k + m gets bigger?
> >>
> >>
> >> Right.  There isn't a nontrivial way to actually erasure code it, and
> >> duplicating on every shard is inefficient.
> >>
> >> One reasonableish approach would be to replicate the omap data on m+1
> >> shards.  But it's a bit of work to implement and nobody has done it.
> >>
> >> I can't remember if there were concerns with this approach or it was just
> >> a matter of time/resources... Josh? Greg?
> >
> >
> > It restricts us to erasure codes like reed-solomon where a subset of shards
> > are always updated. I think this is a reasonable trade-off though, it's just
> > a matter of implementing it. We haven't written
> > up the required peering changes, but they did not seem too difficult to
> > implement.
> >
> > Some notes on the approach are here - just think of 'replicating omap'
> > as a partial write to m+1 shards:
> >
> > http://pad.ceph.com/p/ec-partial-writes
> 
> Yeah. To expand a bit on why this only works for Reed-Solomon,
> consider the minimum and appropriate number of copies — and the actual
> shard placement — for local recovery codes. :/ We were unable to
> generalize for that (or indeed for SHEC, IIRC) when whiteboarding.
> 
> I'm also still nervous that this might do weird things to our recovery
> and availability patterns in more complex failure cases, but I don't
> have any concrete issues.

It seems like the minimum-viable variation of this is that we don't change 
any of the peering or logging behavior at all, but just send the omap 
writes to all shards (like any other write), but only the annointed shards 
persist.

That leaves lots of room for improvement, but it makes the feature work 
without many changes, and means we can drop the specialness around rbd 
images in EC pools.

Then we can make CephFS and RGW issue warnings (or even refuse) to use EC 
pools for their metadata or index pools since it's strictly less efficient 
than replicated to avoid user mistakes.

?

sage

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Why does Erasure-pool not support omap?
  2017-10-26 14:26       ` Sage Weil
@ 2017-10-26 15:07         ` Matt Benjamin
  2017-10-26 15:08         ` Jason Dillaman
  2017-10-26 16:21         ` Josh Durgin
  2 siblings, 0 replies; 15+ messages in thread
From: Matt Benjamin @ 2017-10-26 15:07 UTC (permalink / raw)
  To: Sage Weil; +Cc: Gregory Farnum, Josh Durgin, Xie Xingguo, ceph-devel

That sounds like a promising way forward, to me.

Matt

On Thu, Oct 26, 2017 at 10:26 AM, Sage Weil <sweil@redhat.com> wrote:
> On Thu, 26 Oct 2017, Gregory Farnum wrote:
>> On Wed, Oct 25, 2017 at 8:57 PM, Josh Durgin <jdurgin@redhat.com> wrote:
>> > On 10/25/2017 05:16 AM, Sage Weil wrote:
>> >>
>> >> Hi Xingguo,
>> >>
>> >> On Wed, 25 Oct 2017, xie.xingguo@zte.com.cn wrote:
>> >>>
>> >>>        I wonder why erasure-pools can not support omap currently.
>> >>>
>> >>>        The simplest way for erasure-pools to support omap I can figure
>> >>> out would be duplicating omap on every shard.
>> >>>
>> >>>        It is because it consumes too much space when k + m gets bigger?
>> >>
>> >>
>> >> Right.  There isn't a nontrivial way to actually erasure code it, and
>> >> duplicating on every shard is inefficient.
>> >>
>> >> One reasonableish approach would be to replicate the omap data on m+1
>> >> shards.  But it's a bit of work to implement and nobody has done it.
>> >>
>> >> I can't remember if there were concerns with this approach or it was just
>> >> a matter of time/resources... Josh? Greg?
>> >
>> >
>> > It restricts us to erasure codes like reed-solomon where a subset of shards
>> > are always updated. I think this is a reasonable trade-off though, it's just
>> > a matter of implementing it. We haven't written
>> > up the required peering changes, but they did not seem too difficult to
>> > implement.
>> >
>> > Some notes on the approach are here - just think of 'replicating omap'
>> > as a partial write to m+1 shards:
>> >
>> > http://pad.ceph.com/p/ec-partial-writes
>>
>> Yeah. To expand a bit on why this only works for Reed-Solomon,
>> consider the minimum and appropriate number of copies — and the actual
>> shard placement — for local recovery codes. :/ We were unable to
>> generalize for that (or indeed for SHEC, IIRC) when whiteboarding.
>>
>> I'm also still nervous that this might do weird things to our recovery
>> and availability patterns in more complex failure cases, but I don't
>> have any concrete issues.
>
> It seems like the minimum-viable variation of this is that we don't change
> any of the peering or logging behavior at all, but just send the omap
> writes to all shards (like any other write), but only the annointed shards
> persist.
>
> That leaves lots of room for improvement, but it makes the feature work
> without many changes, and means we can drop the specialness around rbd
> images in EC pools.
>
> Then we can make CephFS and RGW issue warnings (or even refuse) to use EC
> pools for their metadata or index pools since it's strictly less efficient
> than replicated to avoid user mistakes.
>
> ?
>
> sage



-- 

Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Why does Erasure-pool not support omap?
  2017-10-26 14:26       ` Sage Weil
  2017-10-26 15:07         ` Matt Benjamin
@ 2017-10-26 15:08         ` Jason Dillaman
  2017-10-26 15:35           ` Matt Benjamin
  2017-10-26 16:10           ` Sage Weil
  2017-10-26 16:21         ` Josh Durgin
  2 siblings, 2 replies; 15+ messages in thread
From: Jason Dillaman @ 2017-10-26 15:08 UTC (permalink / raw)
  To: Sage Weil; +Cc: Gregory Farnum, Josh Durgin, Xie Xingguo, ceph-devel

On Thu, Oct 26, 2017 at 10:26 AM, Sage Weil <sweil@redhat.com> wrote:
> On Thu, 26 Oct 2017, Gregory Farnum wrote:
>> On Wed, Oct 25, 2017 at 8:57 PM, Josh Durgin <jdurgin@redhat.com> wrote:
>> > On 10/25/2017 05:16 AM, Sage Weil wrote:
>> >>
>> >> Hi Xingguo,
>> >>
>> >> On Wed, 25 Oct 2017, xie.xingguo@zte.com.cn wrote:
>> >>>
>> >>>        I wonder why erasure-pools can not support omap currently.
>> >>>
>> >>>        The simplest way for erasure-pools to support omap I can figure
>> >>> out would be duplicating omap on every shard.
>> >>>
>> >>>        It is because it consumes too much space when k + m gets bigger?
>> >>
>> >>
>> >> Right.  There isn't a nontrivial way to actually erasure code it, and
>> >> duplicating on every shard is inefficient.
>> >>
>> >> One reasonableish approach would be to replicate the omap data on m+1
>> >> shards.  But it's a bit of work to implement and nobody has done it.
>> >>
>> >> I can't remember if there were concerns with this approach or it was just
>> >> a matter of time/resources... Josh? Greg?
>> >
>> >
>> > It restricts us to erasure codes like reed-solomon where a subset of shards
>> > are always updated. I think this is a reasonable trade-off though, it's just
>> > a matter of implementing it. We haven't written
>> > up the required peering changes, but they did not seem too difficult to
>> > implement.
>> >
>> > Some notes on the approach are here - just think of 'replicating omap'
>> > as a partial write to m+1 shards:
>> >
>> > http://pad.ceph.com/p/ec-partial-writes
>>
>> Yeah. To expand a bit on why this only works for Reed-Solomon,
>> consider the minimum and appropriate number of copies — and the actual
>> shard placement — for local recovery codes. :/ We were unable to
>> generalize for that (or indeed for SHEC, IIRC) when whiteboarding.
>>
>> I'm also still nervous that this might do weird things to our recovery
>> and availability patterns in more complex failure cases, but I don't
>> have any concrete issues.
>
> It seems like the minimum-viable variation of this is that we don't change
> any of the peering or logging behavior at all, but just send the omap
> writes to all shards (like any other write), but only the annointed shards
> persist.
>
> That leaves lots of room for improvement, but it makes the feature work
> without many changes, and means we can drop the specialness around rbd
> images in EC pools.

Potentially negative since RBD relies heavily on class methods.
Assuming the cls_cxx_map_XYZ operations will never require async work,
there is still the issue with methods that perform straight read/write
calls.

> Then we can make CephFS and RGW issue warnings (or even refuse) to use EC
> pools for their metadata or index pools since it's strictly less efficient
> than replicated to avoid user mistakes.
>
> ?
>
> sage



-- 
Jason

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Why does Erasure-pool not support omap?
  2017-10-26 15:08         ` Jason Dillaman
@ 2017-10-26 15:35           ` Matt Benjamin
  2017-10-26 15:49             ` Jason Dillaman
  2017-10-26 16:10           ` Sage Weil
  1 sibling, 1 reply; 15+ messages in thread
From: Matt Benjamin @ 2017-10-26 15:35 UTC (permalink / raw)
  To: Dillaman, Jason
  Cc: Sage Weil, Gregory Farnum, Josh Durgin, Xie Xingguo, ceph-devel

I had the understanding that RGW's use of class methods, which is also
extensive, would be compatible with this approach.  Is there reason to
doubt that?

Matt

On Thu, Oct 26, 2017 at 11:08 AM, Jason Dillaman <jdillama@redhat.com> wrote:
> On Thu, Oct 26, 2017 at 10:26 AM, Sage Weil <sweil@redhat.com> wrote:
>> On Thu, 26 Oct 2017, Gregory Farnum wrote:
>>> On Wed, Oct 25, 2017 at 8:57 PM, Josh Durgin <jdurgin@redhat.com> wrote:
>>> > On 10/25/2017 05:16 AM, Sage Weil wrote:
>>> >>
>>> >> Hi Xingguo,
>>> >>
>>> >> On Wed, 25 Oct 2017, xie.xingguo@zte.com.cn wrote:
>>> >>>
>>> >>>        I wonder why erasure-pools can not support omap currently.
>>> >>>
>>> >>>        The simplest way for erasure-pools to support omap I can figure
>>> >>> out would be duplicating omap on every shard.
>>> >>>
>>> >>>        It is because it consumes too much space when k + m gets bigger?
>>> >>
>>> >>
>>> >> Right.  There isn't a nontrivial way to actually erasure code it, and
>>> >> duplicating on every shard is inefficient.
>>> >>
>>> >> One reasonableish approach would be to replicate the omap data on m+1
>>> >> shards.  But it's a bit of work to implement and nobody has done it.
>>> >>
>>> >> I can't remember if there were concerns with this approach or it was just
>>> >> a matter of time/resources... Josh? Greg?
>>> >
>>> >
>>> > It restricts us to erasure codes like reed-solomon where a subset of shards
>>> > are always updated. I think this is a reasonable trade-off though, it's just
>>> > a matter of implementing it. We haven't written
>>> > up the required peering changes, but they did not seem too difficult to
>>> > implement.
>>> >
>>> > Some notes on the approach are here - just think of 'replicating omap'
>>> > as a partial write to m+1 shards:
>>> >
>>> > http://pad.ceph.com/p/ec-partial-writes
>>>
>>> Yeah. To expand a bit on why this only works for Reed-Solomon,
>>> consider the minimum and appropriate number of copies — and the actual
>>> shard placement — for local recovery codes. :/ We were unable to
>>> generalize for that (or indeed for SHEC, IIRC) when whiteboarding.
>>>
>>> I'm also still nervous that this might do weird things to our recovery
>>> and availability patterns in more complex failure cases, but I don't
>>> have any concrete issues.
>>
>> It seems like the minimum-viable variation of this is that we don't change
>> any of the peering or logging behavior at all, but just send the omap
>> writes to all shards (like any other write), but only the annointed shards
>> persist.
>>
>> That leaves lots of room for improvement, but it makes the feature work
>> without many changes, and means we can drop the specialness around rbd
>> images in EC pools.
>
> Potentially negative since RBD relies heavily on class methods.
> Assuming the cls_cxx_map_XYZ operations will never require async work,
> there is still the issue with methods that perform straight read/write
> calls.
>
>> Then we can make CephFS and RGW issue warnings (or even refuse) to use EC
>> pools for their metadata or index pools since it's strictly less efficient
>> than replicated to avoid user mistakes.
>>
>> ?
>>
>> sage
>
>
>
> --
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 

Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Why does Erasure-pool not support omap?
  2017-10-26 15:35           ` Matt Benjamin
@ 2017-10-26 15:49             ` Jason Dillaman
  2017-10-26 15:50               ` Matt Benjamin
  0 siblings, 1 reply; 15+ messages in thread
From: Jason Dillaman @ 2017-10-26 15:49 UTC (permalink / raw)
  To: Matt Benjamin
  Cc: Sage Weil, Gregory Farnum, Josh Durgin, Xie Xingguo, ceph-devel

On Thu, Oct 26, 2017 at 11:35 AM, Matt Benjamin <mbenjami@redhat.com> wrote:
> I had the understanding that RGW's use of class methods, which is also
> extensive, would be compatible with this approach.  Is there reason to
> doubt that?

I don't see any "cls_cxx_read" calls in RGW's class methods. Like I
said, assuming the omap class object calls remain synchronous on an
EC-backed pool, omap won't be an issue for RBD but reads will be an
issue (v1 directory, RBD image id objects, and the object map).

> Matt
>
> On Thu, Oct 26, 2017 at 11:08 AM, Jason Dillaman <jdillama@redhat.com> wrote:
>> On Thu, Oct 26, 2017 at 10:26 AM, Sage Weil <sweil@redhat.com> wrote:
>>> On Thu, 26 Oct 2017, Gregory Farnum wrote:
>>>> On Wed, Oct 25, 2017 at 8:57 PM, Josh Durgin <jdurgin@redhat.com> wrote:
>>>> > On 10/25/2017 05:16 AM, Sage Weil wrote:
>>>> >>
>>>> >> Hi Xingguo,
>>>> >>
>>>> >> On Wed, 25 Oct 2017, xie.xingguo@zte.com.cn wrote:
>>>> >>>
>>>> >>>        I wonder why erasure-pools can not support omap currently.
>>>> >>>
>>>> >>>        The simplest way for erasure-pools to support omap I can figure
>>>> >>> out would be duplicating omap on every shard.
>>>> >>>
>>>> >>>        It is because it consumes too much space when k + m gets bigger?
>>>> >>
>>>> >>
>>>> >> Right.  There isn't a nontrivial way to actually erasure code it, and
>>>> >> duplicating on every shard is inefficient.
>>>> >>
>>>> >> One reasonableish approach would be to replicate the omap data on m+1
>>>> >> shards.  But it's a bit of work to implement and nobody has done it.
>>>> >>
>>>> >> I can't remember if there were concerns with this approach or it was just
>>>> >> a matter of time/resources... Josh? Greg?
>>>> >
>>>> >
>>>> > It restricts us to erasure codes like reed-solomon where a subset of shards
>>>> > are always updated. I think this is a reasonable trade-off though, it's just
>>>> > a matter of implementing it. We haven't written
>>>> > up the required peering changes, but they did not seem too difficult to
>>>> > implement.
>>>> >
>>>> > Some notes on the approach are here - just think of 'replicating omap'
>>>> > as a partial write to m+1 shards:
>>>> >
>>>> > http://pad.ceph.com/p/ec-partial-writes
>>>>
>>>> Yeah. To expand a bit on why this only works for Reed-Solomon,
>>>> consider the minimum and appropriate number of copies — and the actual
>>>> shard placement — for local recovery codes. :/ We were unable to
>>>> generalize for that (or indeed for SHEC, IIRC) when whiteboarding.
>>>>
>>>> I'm also still nervous that this might do weird things to our recovery
>>>> and availability patterns in more complex failure cases, but I don't
>>>> have any concrete issues.
>>>
>>> It seems like the minimum-viable variation of this is that we don't change
>>> any of the peering or logging behavior at all, but just send the omap
>>> writes to all shards (like any other write), but only the annointed shards
>>> persist.
>>>
>>> That leaves lots of room for improvement, but it makes the feature work
>>> without many changes, and means we can drop the specialness around rbd
>>> images in EC pools.
>>
>> Potentially negative since RBD relies heavily on class methods.
>> Assuming the cls_cxx_map_XYZ operations will never require async work,
>> there is still the issue with methods that perform straight read/write
>> calls.
>>
>>> Then we can make CephFS and RGW issue warnings (or even refuse) to use EC
>>> pools for their metadata or index pools since it's strictly less efficient
>>> than replicated to avoid user mistakes.
>>>
>>> ?
>>>
>>> sage
>>
>>
>>
>> --
>> Jason
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
>
> Matt Benjamin
> Red Hat, Inc.
> 315 West Huron Street, Suite 140A
> Ann Arbor, Michigan 48103
>
> http://www.redhat.com/en/technologies/storage
>
> tel.  734-821-5101
> fax.  734-769-8938
> cel.  734-216-5309



-- 
Jason

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Why does Erasure-pool not support omap?
  2017-10-26 15:49             ` Jason Dillaman
@ 2017-10-26 15:50               ` Matt Benjamin
  0 siblings, 0 replies; 15+ messages in thread
From: Matt Benjamin @ 2017-10-26 15:50 UTC (permalink / raw)
  To: Dillaman, Jason
  Cc: Sage Weil, Gregory Farnum, Josh Durgin, Xie Xingguo, ceph-devel

thanks for the explanation, Jason

Matt

On Thu, Oct 26, 2017 at 11:49 AM, Jason Dillaman <jdillama@redhat.com> wrote:
> On Thu, Oct 26, 2017 at 11:35 AM, Matt Benjamin <mbenjami@redhat.com> wrote:
>> I had the understanding that RGW's use of class methods, which is also
>> extensive, would be compatible with this approach.  Is there reason to
>> doubt that?
>
> I don't see any "cls_cxx_read" calls in RGW's class methods. Like I
> said, assuming the omap class object calls remain synchronous on an
> EC-backed pool, omap won't be an issue for RBD but reads will be an
> issue (v1 directory, RBD image id objects, and the object map).
>
>> Matt
>>
>> On Thu, Oct 26, 2017 at 11:08 AM, Jason Dillaman <jdillama@redhat.com> wrote:
>>> On Thu, Oct 26, 2017 at 10:26 AM, Sage Weil <sweil@redhat.com> wrote:
>>>> On Thu, 26 Oct 2017, Gregory Farnum wrote:
>>>>> On Wed, Oct 25, 2017 at 8:57 PM, Josh Durgin <jdurgin@redhat.com> wrote:
>>>>> > On 10/25/2017 05:16 AM, Sage Weil wrote:
>>>>> >>
>>>>> >> Hi Xingguo,
>>>>> >>
>>>>> >> On Wed, 25 Oct 2017, xie.xingguo@zte.com.cn wrote:
>>>>> >>>
>>>>> >>>        I wonder why erasure-pools can not support omap currently.
>>>>> >>>
>>>>> >>>        The simplest way for erasure-pools to support omap I can figure
>>>>> >>> out would be duplicating omap on every shard.
>>>>> >>>
>>>>> >>>        It is because it consumes too much space when k + m gets bigger?
>>>>> >>
>>>>> >>
>>>>> >> Right.  There isn't a nontrivial way to actually erasure code it, and
>>>>> >> duplicating on every shard is inefficient.
>>>>> >>
>>>>> >> One reasonableish approach would be to replicate the omap data on m+1
>>>>> >> shards.  But it's a bit of work to implement and nobody has done it.
>>>>> >>
>>>>> >> I can't remember if there were concerns with this approach or it was just
>>>>> >> a matter of time/resources... Josh? Greg?
>>>>> >
>>>>> >
>>>>> > It restricts us to erasure codes like reed-solomon where a subset of shards
>>>>> > are always updated. I think this is a reasonable trade-off though, it's just
>>>>> > a matter of implementing it. We haven't written
>>>>> > up the required peering changes, but they did not seem too difficult to
>>>>> > implement.
>>>>> >
>>>>> > Some notes on the approach are here - just think of 'replicating omap'
>>>>> > as a partial write to m+1 shards:
>>>>> >
>>>>> > http://pad.ceph.com/p/ec-partial-writes
>>>>>
>>>>> Yeah. To expand a bit on why this only works for Reed-Solomon,
>>>>> consider the minimum and appropriate number of copies — and the actual
>>>>> shard placement — for local recovery codes. :/ We were unable to
>>>>> generalize for that (or indeed for SHEC, IIRC) when whiteboarding.
>>>>>
>>>>> I'm also still nervous that this might do weird things to our recovery
>>>>> and availability patterns in more complex failure cases, but I don't
>>>>> have any concrete issues.
>>>>
>>>> It seems like the minimum-viable variation of this is that we don't change
>>>> any of the peering or logging behavior at all, but just send the omap
>>>> writes to all shards (like any other write), but only the annointed shards
>>>> persist.
>>>>
>>>> That leaves lots of room for improvement, but it makes the feature work
>>>> without many changes, and means we can drop the specialness around rbd
>>>> images in EC pools.
>>>
>>> Potentially negative since RBD relies heavily on class methods.
>>> Assuming the cls_cxx_map_XYZ operations will never require async work,
>>> there is still the issue with methods that perform straight read/write
>>> calls.
>>>
>>>> Then we can make CephFS and RGW issue warnings (or even refuse) to use EC
>>>> pools for their metadata or index pools since it's strictly less efficient
>>>> than replicated to avoid user mistakes.
>>>>
>>>> ?
>>>>
>>>> sage
>>>
>>>
>>>
>>> --
>>> Jason
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>>
>> Matt Benjamin
>> Red Hat, Inc.
>> 315 West Huron Street, Suite 140A
>> Ann Arbor, Michigan 48103
>>
>> http://www.redhat.com/en/technologies/storage
>>
>> tel.  734-821-5101
>> fax.  734-769-8938
>> cel.  734-216-5309
>
>
>
> --
> Jason



-- 

Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Why does Erasure-pool not support omap?
  2017-10-26 15:08         ` Jason Dillaman
  2017-10-26 15:35           ` Matt Benjamin
@ 2017-10-26 16:10           ` Sage Weil
  1 sibling, 0 replies; 15+ messages in thread
From: Sage Weil @ 2017-10-26 16:10 UTC (permalink / raw)
  To: dillaman; +Cc: Gregory Farnum, Josh Durgin, Xie Xingguo, ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3354 bytes --]

On Thu, 26 Oct 2017, Jason Dillaman wrote:
> On Thu, Oct 26, 2017 at 10:26 AM, Sage Weil <sweil@redhat.com> wrote:
> > On Thu, 26 Oct 2017, Gregory Farnum wrote:
> >> On Wed, Oct 25, 2017 at 8:57 PM, Josh Durgin <jdurgin@redhat.com> wrote:
> >> > On 10/25/2017 05:16 AM, Sage Weil wrote:
> >> >>
> >> >> Hi Xingguo,
> >> >>
> >> >> On Wed, 25 Oct 2017, xie.xingguo@zte.com.cn wrote:
> >> >>>
> >> >>>        I wonder why erasure-pools can not support omap currently.
> >> >>>
> >> >>>        The simplest way for erasure-pools to support omap I can figure
> >> >>> out would be duplicating omap on every shard.
> >> >>>
> >> >>>        It is because it consumes too much space when k + m gets bigger?
> >> >>
> >> >>
> >> >> Right.  There isn't a nontrivial way to actually erasure code it, and
> >> >> duplicating on every shard is inefficient.
> >> >>
> >> >> One reasonableish approach would be to replicate the omap data on m+1
> >> >> shards.  But it's a bit of work to implement and nobody has done it.
> >> >>
> >> >> I can't remember if there were concerns with this approach or it was just
> >> >> a matter of time/resources... Josh? Greg?
> >> >
> >> >
> >> > It restricts us to erasure codes like reed-solomon where a subset of shards
> >> > are always updated. I think this is a reasonable trade-off though, it's just
> >> > a matter of implementing it. We haven't written
> >> > up the required peering changes, but they did not seem too difficult to
> >> > implement.
> >> >
> >> > Some notes on the approach are here - just think of 'replicating omap'
> >> > as a partial write to m+1 shards:
> >> >
> >> > http://pad.ceph.com/p/ec-partial-writes
> >>
> >> Yeah. To expand a bit on why this only works for Reed-Solomon,
> >> consider the minimum and appropriate number of copies — and the actual
> >> shard placement — for local recovery codes. :/ We were unable to
> >> generalize for that (or indeed for SHEC, IIRC) when whiteboarding.
> >>
> >> I'm also still nervous that this might do weird things to our recovery
> >> and availability patterns in more complex failure cases, but I don't
> >> have any concrete issues.
> >
> > It seems like the minimum-viable variation of this is that we don't change
> > any of the peering or logging behavior at all, but just send the omap
> > writes to all shards (like any other write), but only the annointed shards
> > persist.
> >
> > That leaves lots of room for improvement, but it makes the feature work
> > without many changes, and means we can drop the specialness around rbd
> > images in EC pools.
> 
> Potentially negative since RBD relies heavily on class methods.
> Assuming the cls_cxx_map_XYZ operations will never require async work,
> there is still the issue with methods that perform straight read/write
> calls.

Ooooh right, I remember now.  This can be avoided most of the time by 
making the +1 of k+1 be the (normal) primary, but in a degraded situation 
the primary might be a shard that doesn't have a copy of the omap at all, 
in which case a simple omap read would need to be async.

...and perhaps this too can be avoided by making the primary role always 
be one of the k+1 shards that has a copy of the omap data.

I think cls_rgw is in the same boat as cls_rbd, in that it is attr and 
omap only and doesn't use the object data payload of index objects?

sage

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Why does Erasure-pool not support omap?
  2017-10-26 14:26       ` Sage Weil
  2017-10-26 15:07         ` Matt Benjamin
  2017-10-26 15:08         ` Jason Dillaman
@ 2017-10-26 16:21         ` Josh Durgin
  2017-10-26 17:32           ` Matt Benjamin
  2017-10-30 22:48           ` Gregory Farnum
  2 siblings, 2 replies; 15+ messages in thread
From: Josh Durgin @ 2017-10-26 16:21 UTC (permalink / raw)
  To: Sage Weil, Gregory Farnum; +Cc: Xie Xingguo, ceph-devel

On 10/26/2017 07:26 AM, Sage Weil wrote:
> On Thu, 26 Oct 2017, Gregory Farnum wrote:
>> On Wed, Oct 25, 2017 at 8:57 PM, Josh Durgin <jdurgin@redhat.com> wrote:
>>> On 10/25/2017 05:16 AM, Sage Weil wrote:
>>>>
>>>> Hi Xingguo,
>>>>
>>>> On Wed, 25 Oct 2017, xie.xingguo@zte.com.cn wrote:
>>>>>
>>>>>         I wonder why erasure-pools can not support omap currently.
>>>>>
>>>>>         The simplest way for erasure-pools to support omap I can figure
>>>>> out would be duplicating omap on every shard.
>>>>>
>>>>>         It is because it consumes too much space when k + m gets bigger?
>>>>
>>>>
>>>> Right.  There isn't a nontrivial way to actually erasure code it, and
>>>> duplicating on every shard is inefficient.
>>>>
>>>> One reasonableish approach would be to replicate the omap data on m+1
>>>> shards.  But it's a bit of work to implement and nobody has done it.
>>>>
>>>> I can't remember if there were concerns with this approach or it was just
>>>> a matter of time/resources... Josh? Greg?
>>>
>>>
>>> It restricts us to erasure codes like reed-solomon where a subset of shards
>>> are always updated. I think this is a reasonable trade-off though, it's just
>>> a matter of implementing it. We haven't written
>>> up the required peering changes, but they did not seem too difficult to
>>> implement.
>>>
>>> Some notes on the approach are here - just think of 'replicating omap'
>>> as a partial write to m+1 shards:
>>>
>>> http://pad.ceph.com/p/ec-partial-writes
>>
>> Yeah. To expand a bit on why this only works for Reed-Solomon,
>> consider the minimum and appropriate number of copies — and the actual
>> shard placement — for local recovery codes. :/ We were unable to
>> generalize for that (or indeed for SHEC, IIRC) when whiteboarding.
>>
>> I'm also still nervous that this might do weird things to our recovery
>> and availability patterns in more complex failure cases, but I don't
>> have any concrete issues.
> 
> It seems like the minimum-viable variation of this is that we don't change
> any of the peering or logging behavior at all, but just send the omap
> writes to all shards (like any other write), but only the annointed shards
> persist.
> 
> That leaves lots of room for improvement, but it makes the feature work
> without many changes, and means we can drop the specialness around rbd
> images in EC pools.

Won't that still require recovery and read path changes?

> Then we can make CephFS and RGW issue warnings (or even refuse) to use EC
> pools for their metadata or index pools since it's strictly less efficient
> than replicated to avoid user mistakes.

If this is only for rbd, we might as well store k+m copies since there's
so little omap data.

I agree cephfs and rgw continue to refuse to use EC for metadata, since
their omap use gets far to large and is in the data path.

Josh

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Why does Erasure-pool not support omap?
  2017-10-26 16:21         ` Josh Durgin
@ 2017-10-26 17:32           ` Matt Benjamin
  2017-10-30 22:48           ` Gregory Farnum
  1 sibling, 0 replies; 15+ messages in thread
From: Matt Benjamin @ 2017-10-26 17:32 UTC (permalink / raw)
  To: Josh Durgin; +Cc: Sage Weil, Gregory Farnum, Xie Xingguo, ceph-devel

On Thu, Oct 26, 2017 at 12:21 PM, Josh Durgin <jdurgin@redhat.com> wrote:
> On 10/26/2017 07:26 AM, Sage Weil wrote:
>>

>
> If this is only for rbd, we might as well store k+m copies since there's
> so little omap data.
>
> I agree cephfs and rgw continue to refuse to use EC for metadata, since
> their omap use gets far to large and is in the data path.

This highlights, but isn't an answer to the problem.  What RGW
actually has is a split data path, when data is on EC.

Matt

-- 

Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Why does Erasure-pool not support omap?
  2017-10-26 16:21         ` Josh Durgin
  2017-10-26 17:32           ` Matt Benjamin
@ 2017-10-30 22:48           ` Gregory Farnum
  2017-10-31  2:26             ` Sage Weil
  1 sibling, 1 reply; 15+ messages in thread
From: Gregory Farnum @ 2017-10-30 22:48 UTC (permalink / raw)
  To: Sage Weil; +Cc: Xie Xingguo, ceph-devel, Josh Durgin

On Thu, Oct 26, 2017 at 9:21 AM Josh Durgin <jdurgin@redhat.com> wrote:
>
> On 10/26/2017 07:26 AM, Sage Weil wrote:
> > On Thu, 26 Oct 2017, Gregory Farnum wrote:
> >> On Wed, Oct 25, 2017 at 8:57 PM, Josh Durgin <jdurgin@redhat.com> wrote:
> >>> On 10/25/2017 05:16 AM, Sage Weil wrote:
> >>>>
> >>>> Hi Xingguo,
> >>>>
> >>>> On Wed, 25 Oct 2017, xie.xingguo@zte.com.cn wrote:
> >>>>>
> >>>>>         I wonder why erasure-pools can not support omap currently.
> >>>>>
> >>>>>         The simplest way for erasure-pools to support omap I can figure
> >>>>> out would be duplicating omap on every shard.
> >>>>>
> >>>>>         It is because it consumes too much space when k + m gets bigger?
> >>>>
> >>>>
> >>>> Right.  There isn't a nontrivial way to actually erasure code it, and
> >>>> duplicating on every shard is inefficient.
> >>>>
> >>>> One reasonableish approach would be to replicate the omap data on m+1
> >>>> shards.  But it's a bit of work to implement and nobody has done it.
> >>>>
> >>>> I can't remember if there were concerns with this approach or it was just
> >>>> a matter of time/resources... Josh? Greg?
> >>>
> >>>
> >>> It restricts us to erasure codes like reed-solomon where a subset of shards
> >>> are always updated. I think this is a reasonable trade-off though, it's just
> >>> a matter of implementing it. We haven't written
> >>> up the required peering changes, but they did not seem too difficult to
> >>> implement.
> >>>
> >>> Some notes on the approach are here - just think of 'replicating omap'
> >>> as a partial write to m+1 shards:
> >>>
> >>> http://pad.ceph.com/p/ec-partial-writes
> >>
> >> Yeah. To expand a bit on why this only works for Reed-Solomon,
> >> consider the minimum and appropriate number of copies — and the actual
> >> shard placement — for local recovery codes. :/ We were unable to
> >> generalize for that (or indeed for SHEC, IIRC) when whiteboarding.
> >>
> >> I'm also still nervous that this might do weird things to our recovery
> >> and availability patterns in more complex failure cases, but I don't
> >> have any concrete issues.
> >
> > It seems like the minimum-viable variation of this is that we don't change
> > any of the peering or logging behavior at all, but just send the omap
> > writes to all shards (like any other write), but only the annointed shards
> > persist.
> >
> > That leaves lots of room for improvement, but it makes the feature work
> > without many changes, and means we can drop the specialness around rbd
> > images in EC pools.
>
> Won't that still require recovery and read path changes?


I also don't understand at all how this would work. Can you expand, Sage?
-Greg

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Why does Erasure-pool not support omap?
  2017-10-30 22:48           ` Gregory Farnum
@ 2017-10-31  2:26             ` Sage Weil
  2017-11-01 20:27               ` Gregory Farnum
  0 siblings, 1 reply; 15+ messages in thread
From: Sage Weil @ 2017-10-31  2:26 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Xie Xingguo, ceph-devel, Josh Durgin

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3655 bytes --]

On Mon, 30 Oct 2017, Gregory Farnum wrote:
> On Thu, Oct 26, 2017 at 9:21 AM Josh Durgin <jdurgin@redhat.com> wrote:
> >
> > On 10/26/2017 07:26 AM, Sage Weil wrote:
> > > On Thu, 26 Oct 2017, Gregory Farnum wrote:
> > >> On Wed, Oct 25, 2017 at 8:57 PM, Josh Durgin <jdurgin@redhat.com> wrote:
> > >>> On 10/25/2017 05:16 AM, Sage Weil wrote:
> > >>>>
> > >>>> Hi Xingguo,
> > >>>>
> > >>>> On Wed, 25 Oct 2017, xie.xingguo@zte.com.cn wrote:
> > >>>>>
> > >>>>>         I wonder why erasure-pools can not support omap currently.
> > >>>>>
> > >>>>>         The simplest way for erasure-pools to support omap I can figure
> > >>>>> out would be duplicating omap on every shard.
> > >>>>>
> > >>>>>         It is because it consumes too much space when k + m gets bigger?
> > >>>>
> > >>>>
> > >>>> Right.  There isn't a nontrivial way to actually erasure code it, and
> > >>>> duplicating on every shard is inefficient.
> > >>>>
> > >>>> One reasonableish approach would be to replicate the omap data on m+1
> > >>>> shards.  But it's a bit of work to implement and nobody has done it.
> > >>>>
> > >>>> I can't remember if there were concerns with this approach or it was just
> > >>>> a matter of time/resources... Josh? Greg?
> > >>>
> > >>>
> > >>> It restricts us to erasure codes like reed-solomon where a subset of shards
> > >>> are always updated. I think this is a reasonable trade-off though, it's just
> > >>> a matter of implementing it. We haven't written
> > >>> up the required peering changes, but they did not seem too difficult to
> > >>> implement.
> > >>>
> > >>> Some notes on the approach are here - just think of 'replicating omap'
> > >>> as a partial write to m+1 shards:
> > >>>
> > >>> http://pad.ceph.com/p/ec-partial-writes
> > >>
> > >> Yeah. To expand a bit on why this only works for Reed-Solomon,
> > >> consider the minimum and appropriate number of copies — and the actual
> > >> shard placement — for local recovery codes. :/ We were unable to
> > >> generalize for that (or indeed for SHEC, IIRC) when whiteboarding.
> > >>
> > >> I'm also still nervous that this might do weird things to our recovery
> > >> and availability patterns in more complex failure cases, but I don't
> > >> have any concrete issues.
> > >
> > > It seems like the minimum-viable variation of this is that we don't change
> > > any of the peering or logging behavior at all, but just send the omap
> > > writes to all shards (like any other write), but only the annointed shards
> > > persist.
> > >
> > > That leaves lots of room for improvement, but it makes the feature work
> > > without many changes, and means we can drop the specialness around rbd
> > > images in EC pools.
> >
> > Won't that still require recovery and read path changes?
> 
> 
> I also don't understand at all how this would work. Can you expand, Sage?

On write, the ECTransaction collects the omap operation.  We either send 
it to all shards or slide just the omap key/value data for shard_id > k.  
For shard_id <= k, we write the omap data to the local object.  We still 
send the write op to all shards with attrs and pg log entries.

We take care to always select the first acting shard as the primary, which 
will ensure a shard_id <= k if we go active, such that cls operations and 
omap reads can be handled locally.

Hmm, I think the problem is with rollback, though.  IIRC the code is 
structured around rollback and not rollforward, and omap writes are blind.  

So, not trivial, but it doesn't require any of the stuff we were talking 
about before where we'd only send writes to a subset of shards and have 
incomplete pg logs on each shard.

sage

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Why does Erasure-pool not support omap?
  2017-10-31  2:26             ` Sage Weil
@ 2017-11-01 20:27               ` Gregory Farnum
  0 siblings, 0 replies; 15+ messages in thread
From: Gregory Farnum @ 2017-11-01 20:27 UTC (permalink / raw)
  To: Sage Weil; +Cc: Xie Xingguo, ceph-devel, Josh Durgin

On Mon, Oct 30, 2017 at 7:26 PM Sage Weil <sweil@redhat.com> wrote:
>
> On Mon, 30 Oct 2017, Gregory Farnum wrote:
> > On Thu, Oct 26, 2017 at 9:21 AM Josh Durgin <jdurgin@redhat.com> wrote:
> > >
> > > On 10/26/2017 07:26 AM, Sage Weil wrote:
> > > > On Thu, 26 Oct 2017, Gregory Farnum wrote:
> > > >> On Wed, Oct 25, 2017 at 8:57 PM, Josh Durgin <jdurgin@redhat.com> wrote:
> > > >>> On 10/25/2017 05:16 AM, Sage Weil wrote:
> > > >>>>
> > > >>>> Hi Xingguo,
> > > >>>>
> > > >>>> On Wed, 25 Oct 2017, xie.xingguo@zte.com.cn wrote:
> > > >>>>>
> > > >>>>>         I wonder why erasure-pools can not support omap currently.
> > > >>>>>
> > > >>>>>         The simplest way for erasure-pools to support omap I can figure
> > > >>>>> out would be duplicating omap on every shard.
> > > >>>>>
> > > >>>>>         It is because it consumes too much space when k + m gets bigger?
> > > >>>>
> > > >>>>
> > > >>>> Right.  There isn't a nontrivial way to actually erasure code it, and
> > > >>>> duplicating on every shard is inefficient.
> > > >>>>
> > > >>>> One reasonableish approach would be to replicate the omap data on m+1
> > > >>>> shards.  But it's a bit of work to implement and nobody has done it.
> > > >>>>
> > > >>>> I can't remember if there were concerns with this approach or it was just
> > > >>>> a matter of time/resources... Josh? Greg?
> > > >>>
> > > >>>
> > > >>> It restricts us to erasure codes like reed-solomon where a subset of shards
> > > >>> are always updated. I think this is a reasonable trade-off though, it's just
> > > >>> a matter of implementing it. We haven't written
> > > >>> up the required peering changes, but they did not seem too difficult to
> > > >>> implement.
> > > >>>
> > > >>> Some notes on the approach are here - just think of 'replicating omap'
> > > >>> as a partial write to m+1 shards:
> > > >>>
> > > >>> http://pad.ceph.com/p/ec-partial-writes
> > > >>
> > > >> Yeah. To expand a bit on why this only works for Reed-Solomon,
> > > >> consider the minimum and appropriate number of copies — and the actual
> > > >> shard placement — for local recovery codes. :/ We were unable to
> > > >> generalize for that (or indeed for SHEC, IIRC) when whiteboarding.
> > > >>
> > > >> I'm also still nervous that this might do weird things to our recovery
> > > >> and availability patterns in more complex failure cases, but I don't
> > > >> have any concrete issues.
> > > >
> > > > It seems like the minimum-viable variation of this is that we don't change
> > > > any of the peering or logging behavior at all, but just send the omap
> > > > writes to all shards (like any other write), but only the annointed shards
> > > > persist.
> > > >
> > > > That leaves lots of room for improvement, but it makes the feature work
> > > > without many changes, and means we can drop the specialness around rbd
> > > > images in EC pools.
> > >
> > > Won't that still require recovery and read path changes?
> >
> >
> > I also don't understand at all how this would work. Can you expand, Sage?
>
> On write, the ECTransaction collects the omap operation.  We either send
> it to all shards or slide just the omap key/value data for shard_id > k.
> For shard_id <= k, we write the omap data to the local object.  We still
> send the write op to all shards with attrs and pg log entries.
>
> We take care to always select the first acting shard as the primary, which
> will ensure a shard_id <= k if we go active, such that cls operations and
> omap reads can be handled locally.
>
> Hmm, I think the problem is with rollback, though.  IIRC the code is
> structured around rollback and not rollforward, and omap writes are blind.
>
> So, not trivial, but it doesn't require any of the stuff we were talking
> about before where we'd only send writes to a subset of shards and have
> incomplete pg logs on each shard.


Okay, so by "don't change any of the peering or logging behavior at
all", you meant we didn't have to do any of the stuff that starts
accounting for differing pg versions on each shard's object. But of
course we still need to make a number of changes to the peering and
recovery code so we select the right shards.

We *could* do that, but it seems like another of the "sorta there"
features we end up regretting. I guess the apparent hurry to get it in
before other EC pool enhancements are ready is to avoid the rbd header
pool? How much effort does that actually save? (Even the minimal
peering+recovery changes here will take a fair bit of doing and a lot
of QA qualification.)
-Greg

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2017-11-01 20:27 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <201710251652060421729@zte.com.cn>
2017-10-25 12:16 ` Why does Erasure-pool not support omap? Sage Weil
2017-10-25 18:57   ` Josh Durgin
2017-10-26 14:20     ` Gregory Farnum
2017-10-26 14:26       ` Sage Weil
2017-10-26 15:07         ` Matt Benjamin
2017-10-26 15:08         ` Jason Dillaman
2017-10-26 15:35           ` Matt Benjamin
2017-10-26 15:49             ` Jason Dillaman
2017-10-26 15:50               ` Matt Benjamin
2017-10-26 16:10           ` Sage Weil
2017-10-26 16:21         ` Josh Durgin
2017-10-26 17:32           ` Matt Benjamin
2017-10-30 22:48           ` Gregory Farnum
2017-10-31  2:26             ` Sage Weil
2017-11-01 20:27               ` Gregory Farnum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.