* Re: Why does Erasure-pool not support omap? [not found] <201710251652060421729@zte.com.cn> @ 2017-10-25 12:16 ` Sage Weil 2017-10-25 18:57 ` Josh Durgin 0 siblings, 1 reply; 15+ messages in thread From: Sage Weil @ 2017-10-25 12:16 UTC (permalink / raw) To: xie.xingguo; +Cc: ceph-devel, jdurgin, gfarnum [-- Attachment #1: Type: TEXT/PLAIN, Size: 716 bytes --] Hi Xingguo, On Wed, 25 Oct 2017, xie.xingguo@zte.com.cn wrote: > I wonder why erasure-pools can not support omap currently. > > The simplest way for erasure-pools to support omap I can figure out would be duplicating omap on every shard. > > It is because it consumes too much space when k + m gets bigger? Right. There isn't a nontrivial way to actually erasure code it, and duplicating on every shard is inefficient. One reasonableish approach would be to replicate the omap data on m+1 shards. But it's a bit of work to implement and nobody has done it. I can't remember if there were concerns with this approach or it was just a matter of time/resources... Josh? Greg? sage ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Why does Erasure-pool not support omap? 2017-10-25 12:16 ` Why does Erasure-pool not support omap? Sage Weil @ 2017-10-25 18:57 ` Josh Durgin 2017-10-26 14:20 ` Gregory Farnum 0 siblings, 1 reply; 15+ messages in thread From: Josh Durgin @ 2017-10-25 18:57 UTC (permalink / raw) To: Sage Weil, xie.xingguo; +Cc: ceph-devel, gfarnum On 10/25/2017 05:16 AM, Sage Weil wrote: > Hi Xingguo, > > On Wed, 25 Oct 2017, xie.xingguo@zte.com.cn wrote: >> I wonder why erasure-pools can not support omap currently. >> >> The simplest way for erasure-pools to support omap I can figure out would be duplicating omap on every shard. >> >> It is because it consumes too much space when k + m gets bigger? > > Right. There isn't a nontrivial way to actually erasure code it, and > duplicating on every shard is inefficient. > > One reasonableish approach would be to replicate the omap data on m+1 > shards. But it's a bit of work to implement and nobody has done it. > > I can't remember if there were concerns with this approach or it was just > a matter of time/resources... Josh? Greg? It restricts us to erasure codes like reed-solomon where a subset of shards are always updated. I think this is a reasonable trade-off though, it's just a matter of implementing it. We haven't written up the required peering changes, but they did not seem too difficult to implement. Some notes on the approach are here - just think of 'replicating omap' as a partial write to m+1 shards: http://pad.ceph.com/p/ec-partial-writes ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Why does Erasure-pool not support omap? 2017-10-25 18:57 ` Josh Durgin @ 2017-10-26 14:20 ` Gregory Farnum 2017-10-26 14:26 ` Sage Weil 0 siblings, 1 reply; 15+ messages in thread From: Gregory Farnum @ 2017-10-26 14:20 UTC (permalink / raw) To: Josh Durgin; +Cc: Sage Weil, Xie Xingguo, ceph-devel On Wed, Oct 25, 2017 at 8:57 PM, Josh Durgin <jdurgin@redhat.com> wrote: > On 10/25/2017 05:16 AM, Sage Weil wrote: >> >> Hi Xingguo, >> >> On Wed, 25 Oct 2017, xie.xingguo@zte.com.cn wrote: >>> >>> I wonder why erasure-pools can not support omap currently. >>> >>> The simplest way for erasure-pools to support omap I can figure >>> out would be duplicating omap on every shard. >>> >>> It is because it consumes too much space when k + m gets bigger? >> >> >> Right. There isn't a nontrivial way to actually erasure code it, and >> duplicating on every shard is inefficient. >> >> One reasonableish approach would be to replicate the omap data on m+1 >> shards. But it's a bit of work to implement and nobody has done it. >> >> I can't remember if there were concerns with this approach or it was just >> a matter of time/resources... Josh? Greg? > > > It restricts us to erasure codes like reed-solomon where a subset of shards > are always updated. I think this is a reasonable trade-off though, it's just > a matter of implementing it. We haven't written > up the required peering changes, but they did not seem too difficult to > implement. > > Some notes on the approach are here - just think of 'replicating omap' > as a partial write to m+1 shards: > > http://pad.ceph.com/p/ec-partial-writes Yeah. To expand a bit on why this only works for Reed-Solomon, consider the minimum and appropriate number of copies — and the actual shard placement — for local recovery codes. :/ We were unable to generalize for that (or indeed for SHEC, IIRC) when whiteboarding. I'm also still nervous that this might do weird things to our recovery and availability patterns in more complex failure cases, but I don't have any concrete issues. -Greg ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Why does Erasure-pool not support omap? 2017-10-26 14:20 ` Gregory Farnum @ 2017-10-26 14:26 ` Sage Weil 2017-10-26 15:07 ` Matt Benjamin ` (2 more replies) 0 siblings, 3 replies; 15+ messages in thread From: Sage Weil @ 2017-10-26 14:26 UTC (permalink / raw) To: Gregory Farnum; +Cc: Josh Durgin, Xie Xingguo, ceph-devel [-- Attachment #1: Type: TEXT/PLAIN, Size: 2493 bytes --] On Thu, 26 Oct 2017, Gregory Farnum wrote: > On Wed, Oct 25, 2017 at 8:57 PM, Josh Durgin <jdurgin@redhat.com> wrote: > > On 10/25/2017 05:16 AM, Sage Weil wrote: > >> > >> Hi Xingguo, > >> > >> On Wed, 25 Oct 2017, xie.xingguo@zte.com.cn wrote: > >>> > >>> I wonder why erasure-pools can not support omap currently. > >>> > >>> The simplest way for erasure-pools to support omap I can figure > >>> out would be duplicating omap on every shard. > >>> > >>> It is because it consumes too much space when k + m gets bigger? > >> > >> > >> Right. There isn't a nontrivial way to actually erasure code it, and > >> duplicating on every shard is inefficient. > >> > >> One reasonableish approach would be to replicate the omap data on m+1 > >> shards. But it's a bit of work to implement and nobody has done it. > >> > >> I can't remember if there were concerns with this approach or it was just > >> a matter of time/resources... Josh? Greg? > > > > > > It restricts us to erasure codes like reed-solomon where a subset of shards > > are always updated. I think this is a reasonable trade-off though, it's just > > a matter of implementing it. We haven't written > > up the required peering changes, but they did not seem too difficult to > > implement. > > > > Some notes on the approach are here - just think of 'replicating omap' > > as a partial write to m+1 shards: > > > > http://pad.ceph.com/p/ec-partial-writes > > Yeah. To expand a bit on why this only works for Reed-Solomon, > consider the minimum and appropriate number of copies — and the actual > shard placement — for local recovery codes. :/ We were unable to > generalize for that (or indeed for SHEC, IIRC) when whiteboarding. > > I'm also still nervous that this might do weird things to our recovery > and availability patterns in more complex failure cases, but I don't > have any concrete issues. It seems like the minimum-viable variation of this is that we don't change any of the peering or logging behavior at all, but just send the omap writes to all shards (like any other write), but only the annointed shards persist. That leaves lots of room for improvement, but it makes the feature work without many changes, and means we can drop the specialness around rbd images in EC pools. Then we can make CephFS and RGW issue warnings (or even refuse) to use EC pools for their metadata or index pools since it's strictly less efficient than replicated to avoid user mistakes. ? sage ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Why does Erasure-pool not support omap? 2017-10-26 14:26 ` Sage Weil @ 2017-10-26 15:07 ` Matt Benjamin 2017-10-26 15:08 ` Jason Dillaman 2017-10-26 16:21 ` Josh Durgin 2 siblings, 0 replies; 15+ messages in thread From: Matt Benjamin @ 2017-10-26 15:07 UTC (permalink / raw) To: Sage Weil; +Cc: Gregory Farnum, Josh Durgin, Xie Xingguo, ceph-devel That sounds like a promising way forward, to me. Matt On Thu, Oct 26, 2017 at 10:26 AM, Sage Weil <sweil@redhat.com> wrote: > On Thu, 26 Oct 2017, Gregory Farnum wrote: >> On Wed, Oct 25, 2017 at 8:57 PM, Josh Durgin <jdurgin@redhat.com> wrote: >> > On 10/25/2017 05:16 AM, Sage Weil wrote: >> >> >> >> Hi Xingguo, >> >> >> >> On Wed, 25 Oct 2017, xie.xingguo@zte.com.cn wrote: >> >>> >> >>> I wonder why erasure-pools can not support omap currently. >> >>> >> >>> The simplest way for erasure-pools to support omap I can figure >> >>> out would be duplicating omap on every shard. >> >>> >> >>> It is because it consumes too much space when k + m gets bigger? >> >> >> >> >> >> Right. There isn't a nontrivial way to actually erasure code it, and >> >> duplicating on every shard is inefficient. >> >> >> >> One reasonableish approach would be to replicate the omap data on m+1 >> >> shards. But it's a bit of work to implement and nobody has done it. >> >> >> >> I can't remember if there were concerns with this approach or it was just >> >> a matter of time/resources... Josh? Greg? >> > >> > >> > It restricts us to erasure codes like reed-solomon where a subset of shards >> > are always updated. I think this is a reasonable trade-off though, it's just >> > a matter of implementing it. We haven't written >> > up the required peering changes, but they did not seem too difficult to >> > implement. >> > >> > Some notes on the approach are here - just think of 'replicating omap' >> > as a partial write to m+1 shards: >> > >> > http://pad.ceph.com/p/ec-partial-writes >> >> Yeah. To expand a bit on why this only works for Reed-Solomon, >> consider the minimum and appropriate number of copies — and the actual >> shard placement — for local recovery codes. :/ We were unable to >> generalize for that (or indeed for SHEC, IIRC) when whiteboarding. >> >> I'm also still nervous that this might do weird things to our recovery >> and availability patterns in more complex failure cases, but I don't >> have any concrete issues. > > It seems like the minimum-viable variation of this is that we don't change > any of the peering or logging behavior at all, but just send the omap > writes to all shards (like any other write), but only the annointed shards > persist. > > That leaves lots of room for improvement, but it makes the feature work > without many changes, and means we can drop the specialness around rbd > images in EC pools. > > Then we can make CephFS and RGW issue warnings (or even refuse) to use EC > pools for their metadata or index pools since it's strictly less efficient > than replicated to avoid user mistakes. > > ? > > sage -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-821-5101 fax. 734-769-8938 cel. 734-216-5309 ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Why does Erasure-pool not support omap? 2017-10-26 14:26 ` Sage Weil 2017-10-26 15:07 ` Matt Benjamin @ 2017-10-26 15:08 ` Jason Dillaman 2017-10-26 15:35 ` Matt Benjamin 2017-10-26 16:10 ` Sage Weil 2017-10-26 16:21 ` Josh Durgin 2 siblings, 2 replies; 15+ messages in thread From: Jason Dillaman @ 2017-10-26 15:08 UTC (permalink / raw) To: Sage Weil; +Cc: Gregory Farnum, Josh Durgin, Xie Xingguo, ceph-devel On Thu, Oct 26, 2017 at 10:26 AM, Sage Weil <sweil@redhat.com> wrote: > On Thu, 26 Oct 2017, Gregory Farnum wrote: >> On Wed, Oct 25, 2017 at 8:57 PM, Josh Durgin <jdurgin@redhat.com> wrote: >> > On 10/25/2017 05:16 AM, Sage Weil wrote: >> >> >> >> Hi Xingguo, >> >> >> >> On Wed, 25 Oct 2017, xie.xingguo@zte.com.cn wrote: >> >>> >> >>> I wonder why erasure-pools can not support omap currently. >> >>> >> >>> The simplest way for erasure-pools to support omap I can figure >> >>> out would be duplicating omap on every shard. >> >>> >> >>> It is because it consumes too much space when k + m gets bigger? >> >> >> >> >> >> Right. There isn't a nontrivial way to actually erasure code it, and >> >> duplicating on every shard is inefficient. >> >> >> >> One reasonableish approach would be to replicate the omap data on m+1 >> >> shards. But it's a bit of work to implement and nobody has done it. >> >> >> >> I can't remember if there were concerns with this approach or it was just >> >> a matter of time/resources... Josh? Greg? >> > >> > >> > It restricts us to erasure codes like reed-solomon where a subset of shards >> > are always updated. I think this is a reasonable trade-off though, it's just >> > a matter of implementing it. We haven't written >> > up the required peering changes, but they did not seem too difficult to >> > implement. >> > >> > Some notes on the approach are here - just think of 'replicating omap' >> > as a partial write to m+1 shards: >> > >> > http://pad.ceph.com/p/ec-partial-writes >> >> Yeah. To expand a bit on why this only works for Reed-Solomon, >> consider the minimum and appropriate number of copies — and the actual >> shard placement — for local recovery codes. :/ We were unable to >> generalize for that (or indeed for SHEC, IIRC) when whiteboarding. >> >> I'm also still nervous that this might do weird things to our recovery >> and availability patterns in more complex failure cases, but I don't >> have any concrete issues. > > It seems like the minimum-viable variation of this is that we don't change > any of the peering or logging behavior at all, but just send the omap > writes to all shards (like any other write), but only the annointed shards > persist. > > That leaves lots of room for improvement, but it makes the feature work > without many changes, and means we can drop the specialness around rbd > images in EC pools. Potentially negative since RBD relies heavily on class methods. Assuming the cls_cxx_map_XYZ operations will never require async work, there is still the issue with methods that perform straight read/write calls. > Then we can make CephFS and RGW issue warnings (or even refuse) to use EC > pools for their metadata or index pools since it's strictly less efficient > than replicated to avoid user mistakes. > > ? > > sage -- Jason ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Why does Erasure-pool not support omap? 2017-10-26 15:08 ` Jason Dillaman @ 2017-10-26 15:35 ` Matt Benjamin 2017-10-26 15:49 ` Jason Dillaman 2017-10-26 16:10 ` Sage Weil 1 sibling, 1 reply; 15+ messages in thread From: Matt Benjamin @ 2017-10-26 15:35 UTC (permalink / raw) To: Dillaman, Jason Cc: Sage Weil, Gregory Farnum, Josh Durgin, Xie Xingguo, ceph-devel I had the understanding that RGW's use of class methods, which is also extensive, would be compatible with this approach. Is there reason to doubt that? Matt On Thu, Oct 26, 2017 at 11:08 AM, Jason Dillaman <jdillama@redhat.com> wrote: > On Thu, Oct 26, 2017 at 10:26 AM, Sage Weil <sweil@redhat.com> wrote: >> On Thu, 26 Oct 2017, Gregory Farnum wrote: >>> On Wed, Oct 25, 2017 at 8:57 PM, Josh Durgin <jdurgin@redhat.com> wrote: >>> > On 10/25/2017 05:16 AM, Sage Weil wrote: >>> >> >>> >> Hi Xingguo, >>> >> >>> >> On Wed, 25 Oct 2017, xie.xingguo@zte.com.cn wrote: >>> >>> >>> >>> I wonder why erasure-pools can not support omap currently. >>> >>> >>> >>> The simplest way for erasure-pools to support omap I can figure >>> >>> out would be duplicating omap on every shard. >>> >>> >>> >>> It is because it consumes too much space when k + m gets bigger? >>> >> >>> >> >>> >> Right. There isn't a nontrivial way to actually erasure code it, and >>> >> duplicating on every shard is inefficient. >>> >> >>> >> One reasonableish approach would be to replicate the omap data on m+1 >>> >> shards. But it's a bit of work to implement and nobody has done it. >>> >> >>> >> I can't remember if there were concerns with this approach or it was just >>> >> a matter of time/resources... Josh? Greg? >>> > >>> > >>> > It restricts us to erasure codes like reed-solomon where a subset of shards >>> > are always updated. I think this is a reasonable trade-off though, it's just >>> > a matter of implementing it. We haven't written >>> > up the required peering changes, but they did not seem too difficult to >>> > implement. >>> > >>> > Some notes on the approach are here - just think of 'replicating omap' >>> > as a partial write to m+1 shards: >>> > >>> > http://pad.ceph.com/p/ec-partial-writes >>> >>> Yeah. To expand a bit on why this only works for Reed-Solomon, >>> consider the minimum and appropriate number of copies — and the actual >>> shard placement — for local recovery codes. :/ We were unable to >>> generalize for that (or indeed for SHEC, IIRC) when whiteboarding. >>> >>> I'm also still nervous that this might do weird things to our recovery >>> and availability patterns in more complex failure cases, but I don't >>> have any concrete issues. >> >> It seems like the minimum-viable variation of this is that we don't change >> any of the peering or logging behavior at all, but just send the omap >> writes to all shards (like any other write), but only the annointed shards >> persist. >> >> That leaves lots of room for improvement, but it makes the feature work >> without many changes, and means we can drop the specialness around rbd >> images in EC pools. > > Potentially negative since RBD relies heavily on class methods. > Assuming the cls_cxx_map_XYZ operations will never require async work, > there is still the issue with methods that perform straight read/write > calls. > >> Then we can make CephFS and RGW issue warnings (or even refuse) to use EC >> pools for their metadata or index pools since it's strictly less efficient >> than replicated to avoid user mistakes. >> >> ? >> >> sage > > > > -- > Jason > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-821-5101 fax. 734-769-8938 cel. 734-216-5309 ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Why does Erasure-pool not support omap? 2017-10-26 15:35 ` Matt Benjamin @ 2017-10-26 15:49 ` Jason Dillaman 2017-10-26 15:50 ` Matt Benjamin 0 siblings, 1 reply; 15+ messages in thread From: Jason Dillaman @ 2017-10-26 15:49 UTC (permalink / raw) To: Matt Benjamin Cc: Sage Weil, Gregory Farnum, Josh Durgin, Xie Xingguo, ceph-devel On Thu, Oct 26, 2017 at 11:35 AM, Matt Benjamin <mbenjami@redhat.com> wrote: > I had the understanding that RGW's use of class methods, which is also > extensive, would be compatible with this approach. Is there reason to > doubt that? I don't see any "cls_cxx_read" calls in RGW's class methods. Like I said, assuming the omap class object calls remain synchronous on an EC-backed pool, omap won't be an issue for RBD but reads will be an issue (v1 directory, RBD image id objects, and the object map). > Matt > > On Thu, Oct 26, 2017 at 11:08 AM, Jason Dillaman <jdillama@redhat.com> wrote: >> On Thu, Oct 26, 2017 at 10:26 AM, Sage Weil <sweil@redhat.com> wrote: >>> On Thu, 26 Oct 2017, Gregory Farnum wrote: >>>> On Wed, Oct 25, 2017 at 8:57 PM, Josh Durgin <jdurgin@redhat.com> wrote: >>>> > On 10/25/2017 05:16 AM, Sage Weil wrote: >>>> >> >>>> >> Hi Xingguo, >>>> >> >>>> >> On Wed, 25 Oct 2017, xie.xingguo@zte.com.cn wrote: >>>> >>> >>>> >>> I wonder why erasure-pools can not support omap currently. >>>> >>> >>>> >>> The simplest way for erasure-pools to support omap I can figure >>>> >>> out would be duplicating omap on every shard. >>>> >>> >>>> >>> It is because it consumes too much space when k + m gets bigger? >>>> >> >>>> >> >>>> >> Right. There isn't a nontrivial way to actually erasure code it, and >>>> >> duplicating on every shard is inefficient. >>>> >> >>>> >> One reasonableish approach would be to replicate the omap data on m+1 >>>> >> shards. But it's a bit of work to implement and nobody has done it. >>>> >> >>>> >> I can't remember if there were concerns with this approach or it was just >>>> >> a matter of time/resources... Josh? Greg? >>>> > >>>> > >>>> > It restricts us to erasure codes like reed-solomon where a subset of shards >>>> > are always updated. I think this is a reasonable trade-off though, it's just >>>> > a matter of implementing it. We haven't written >>>> > up the required peering changes, but they did not seem too difficult to >>>> > implement. >>>> > >>>> > Some notes on the approach are here - just think of 'replicating omap' >>>> > as a partial write to m+1 shards: >>>> > >>>> > http://pad.ceph.com/p/ec-partial-writes >>>> >>>> Yeah. To expand a bit on why this only works for Reed-Solomon, >>>> consider the minimum and appropriate number of copies — and the actual >>>> shard placement — for local recovery codes. :/ We were unable to >>>> generalize for that (or indeed for SHEC, IIRC) when whiteboarding. >>>> >>>> I'm also still nervous that this might do weird things to our recovery >>>> and availability patterns in more complex failure cases, but I don't >>>> have any concrete issues. >>> >>> It seems like the minimum-viable variation of this is that we don't change >>> any of the peering or logging behavior at all, but just send the omap >>> writes to all shards (like any other write), but only the annointed shards >>> persist. >>> >>> That leaves lots of room for improvement, but it makes the feature work >>> without many changes, and means we can drop the specialness around rbd >>> images in EC pools. >> >> Potentially negative since RBD relies heavily on class methods. >> Assuming the cls_cxx_map_XYZ operations will never require async work, >> there is still the issue with methods that perform straight read/write >> calls. >> >>> Then we can make CephFS and RGW issue warnings (or even refuse) to use EC >>> pools for their metadata or index pools since it's strictly less efficient >>> than replicated to avoid user mistakes. >>> >>> ? >>> >>> sage >> >> >> >> -- >> Jason >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > > Matt Benjamin > Red Hat, Inc. > 315 West Huron Street, Suite 140A > Ann Arbor, Michigan 48103 > > http://www.redhat.com/en/technologies/storage > > tel. 734-821-5101 > fax. 734-769-8938 > cel. 734-216-5309 -- Jason ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Why does Erasure-pool not support omap? 2017-10-26 15:49 ` Jason Dillaman @ 2017-10-26 15:50 ` Matt Benjamin 0 siblings, 0 replies; 15+ messages in thread From: Matt Benjamin @ 2017-10-26 15:50 UTC (permalink / raw) To: Dillaman, Jason Cc: Sage Weil, Gregory Farnum, Josh Durgin, Xie Xingguo, ceph-devel thanks for the explanation, Jason Matt On Thu, Oct 26, 2017 at 11:49 AM, Jason Dillaman <jdillama@redhat.com> wrote: > On Thu, Oct 26, 2017 at 11:35 AM, Matt Benjamin <mbenjami@redhat.com> wrote: >> I had the understanding that RGW's use of class methods, which is also >> extensive, would be compatible with this approach. Is there reason to >> doubt that? > > I don't see any "cls_cxx_read" calls in RGW's class methods. Like I > said, assuming the omap class object calls remain synchronous on an > EC-backed pool, omap won't be an issue for RBD but reads will be an > issue (v1 directory, RBD image id objects, and the object map). > >> Matt >> >> On Thu, Oct 26, 2017 at 11:08 AM, Jason Dillaman <jdillama@redhat.com> wrote: >>> On Thu, Oct 26, 2017 at 10:26 AM, Sage Weil <sweil@redhat.com> wrote: >>>> On Thu, 26 Oct 2017, Gregory Farnum wrote: >>>>> On Wed, Oct 25, 2017 at 8:57 PM, Josh Durgin <jdurgin@redhat.com> wrote: >>>>> > On 10/25/2017 05:16 AM, Sage Weil wrote: >>>>> >> >>>>> >> Hi Xingguo, >>>>> >> >>>>> >> On Wed, 25 Oct 2017, xie.xingguo@zte.com.cn wrote: >>>>> >>> >>>>> >>> I wonder why erasure-pools can not support omap currently. >>>>> >>> >>>>> >>> The simplest way for erasure-pools to support omap I can figure >>>>> >>> out would be duplicating omap on every shard. >>>>> >>> >>>>> >>> It is because it consumes too much space when k + m gets bigger? >>>>> >> >>>>> >> >>>>> >> Right. There isn't a nontrivial way to actually erasure code it, and >>>>> >> duplicating on every shard is inefficient. >>>>> >> >>>>> >> One reasonableish approach would be to replicate the omap data on m+1 >>>>> >> shards. But it's a bit of work to implement and nobody has done it. >>>>> >> >>>>> >> I can't remember if there were concerns with this approach or it was just >>>>> >> a matter of time/resources... Josh? Greg? >>>>> > >>>>> > >>>>> > It restricts us to erasure codes like reed-solomon where a subset of shards >>>>> > are always updated. I think this is a reasonable trade-off though, it's just >>>>> > a matter of implementing it. We haven't written >>>>> > up the required peering changes, but they did not seem too difficult to >>>>> > implement. >>>>> > >>>>> > Some notes on the approach are here - just think of 'replicating omap' >>>>> > as a partial write to m+1 shards: >>>>> > >>>>> > http://pad.ceph.com/p/ec-partial-writes >>>>> >>>>> Yeah. To expand a bit on why this only works for Reed-Solomon, >>>>> consider the minimum and appropriate number of copies — and the actual >>>>> shard placement — for local recovery codes. :/ We were unable to >>>>> generalize for that (or indeed for SHEC, IIRC) when whiteboarding. >>>>> >>>>> I'm also still nervous that this might do weird things to our recovery >>>>> and availability patterns in more complex failure cases, but I don't >>>>> have any concrete issues. >>>> >>>> It seems like the minimum-viable variation of this is that we don't change >>>> any of the peering or logging behavior at all, but just send the omap >>>> writes to all shards (like any other write), but only the annointed shards >>>> persist. >>>> >>>> That leaves lots of room for improvement, but it makes the feature work >>>> without many changes, and means we can drop the specialness around rbd >>>> images in EC pools. >>> >>> Potentially negative since RBD relies heavily on class methods. >>> Assuming the cls_cxx_map_XYZ operations will never require async work, >>> there is still the issue with methods that perform straight read/write >>> calls. >>> >>>> Then we can make CephFS and RGW issue warnings (or even refuse) to use EC >>>> pools for their metadata or index pools since it's strictly less efficient >>>> than replicated to avoid user mistakes. >>>> >>>> ? >>>> >>>> sage >>> >>> >>> >>> -- >>> Jason >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> >> -- >> >> Matt Benjamin >> Red Hat, Inc. >> 315 West Huron Street, Suite 140A >> Ann Arbor, Michigan 48103 >> >> http://www.redhat.com/en/technologies/storage >> >> tel. 734-821-5101 >> fax. 734-769-8938 >> cel. 734-216-5309 > > > > -- > Jason -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-821-5101 fax. 734-769-8938 cel. 734-216-5309 ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Why does Erasure-pool not support omap? 2017-10-26 15:08 ` Jason Dillaman 2017-10-26 15:35 ` Matt Benjamin @ 2017-10-26 16:10 ` Sage Weil 1 sibling, 0 replies; 15+ messages in thread From: Sage Weil @ 2017-10-26 16:10 UTC (permalink / raw) To: dillaman; +Cc: Gregory Farnum, Josh Durgin, Xie Xingguo, ceph-devel [-- Attachment #1: Type: TEXT/PLAIN, Size: 3354 bytes --] On Thu, 26 Oct 2017, Jason Dillaman wrote: > On Thu, Oct 26, 2017 at 10:26 AM, Sage Weil <sweil@redhat.com> wrote: > > On Thu, 26 Oct 2017, Gregory Farnum wrote: > >> On Wed, Oct 25, 2017 at 8:57 PM, Josh Durgin <jdurgin@redhat.com> wrote: > >> > On 10/25/2017 05:16 AM, Sage Weil wrote: > >> >> > >> >> Hi Xingguo, > >> >> > >> >> On Wed, 25 Oct 2017, xie.xingguo@zte.com.cn wrote: > >> >>> > >> >>> I wonder why erasure-pools can not support omap currently. > >> >>> > >> >>> The simplest way for erasure-pools to support omap I can figure > >> >>> out would be duplicating omap on every shard. > >> >>> > >> >>> It is because it consumes too much space when k + m gets bigger? > >> >> > >> >> > >> >> Right. There isn't a nontrivial way to actually erasure code it, and > >> >> duplicating on every shard is inefficient. > >> >> > >> >> One reasonableish approach would be to replicate the omap data on m+1 > >> >> shards. But it's a bit of work to implement and nobody has done it. > >> >> > >> >> I can't remember if there were concerns with this approach or it was just > >> >> a matter of time/resources... Josh? Greg? > >> > > >> > > >> > It restricts us to erasure codes like reed-solomon where a subset of shards > >> > are always updated. I think this is a reasonable trade-off though, it's just > >> > a matter of implementing it. We haven't written > >> > up the required peering changes, but they did not seem too difficult to > >> > implement. > >> > > >> > Some notes on the approach are here - just think of 'replicating omap' > >> > as a partial write to m+1 shards: > >> > > >> > http://pad.ceph.com/p/ec-partial-writes > >> > >> Yeah. To expand a bit on why this only works for Reed-Solomon, > >> consider the minimum and appropriate number of copies — and the actual > >> shard placement — for local recovery codes. :/ We were unable to > >> generalize for that (or indeed for SHEC, IIRC) when whiteboarding. > >> > >> I'm also still nervous that this might do weird things to our recovery > >> and availability patterns in more complex failure cases, but I don't > >> have any concrete issues. > > > > It seems like the minimum-viable variation of this is that we don't change > > any of the peering or logging behavior at all, but just send the omap > > writes to all shards (like any other write), but only the annointed shards > > persist. > > > > That leaves lots of room for improvement, but it makes the feature work > > without many changes, and means we can drop the specialness around rbd > > images in EC pools. > > Potentially negative since RBD relies heavily on class methods. > Assuming the cls_cxx_map_XYZ operations will never require async work, > there is still the issue with methods that perform straight read/write > calls. Ooooh right, I remember now. This can be avoided most of the time by making the +1 of k+1 be the (normal) primary, but in a degraded situation the primary might be a shard that doesn't have a copy of the omap at all, in which case a simple omap read would need to be async. ...and perhaps this too can be avoided by making the primary role always be one of the k+1 shards that has a copy of the omap data. I think cls_rgw is in the same boat as cls_rbd, in that it is attr and omap only and doesn't use the object data payload of index objects? sage ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Why does Erasure-pool not support omap? 2017-10-26 14:26 ` Sage Weil 2017-10-26 15:07 ` Matt Benjamin 2017-10-26 15:08 ` Jason Dillaman @ 2017-10-26 16:21 ` Josh Durgin 2017-10-26 17:32 ` Matt Benjamin 2017-10-30 22:48 ` Gregory Farnum 2 siblings, 2 replies; 15+ messages in thread From: Josh Durgin @ 2017-10-26 16:21 UTC (permalink / raw) To: Sage Weil, Gregory Farnum; +Cc: Xie Xingguo, ceph-devel On 10/26/2017 07:26 AM, Sage Weil wrote: > On Thu, 26 Oct 2017, Gregory Farnum wrote: >> On Wed, Oct 25, 2017 at 8:57 PM, Josh Durgin <jdurgin@redhat.com> wrote: >>> On 10/25/2017 05:16 AM, Sage Weil wrote: >>>> >>>> Hi Xingguo, >>>> >>>> On Wed, 25 Oct 2017, xie.xingguo@zte.com.cn wrote: >>>>> >>>>> I wonder why erasure-pools can not support omap currently. >>>>> >>>>> The simplest way for erasure-pools to support omap I can figure >>>>> out would be duplicating omap on every shard. >>>>> >>>>> It is because it consumes too much space when k + m gets bigger? >>>> >>>> >>>> Right. There isn't a nontrivial way to actually erasure code it, and >>>> duplicating on every shard is inefficient. >>>> >>>> One reasonableish approach would be to replicate the omap data on m+1 >>>> shards. But it's a bit of work to implement and nobody has done it. >>>> >>>> I can't remember if there were concerns with this approach or it was just >>>> a matter of time/resources... Josh? Greg? >>> >>> >>> It restricts us to erasure codes like reed-solomon where a subset of shards >>> are always updated. I think this is a reasonable trade-off though, it's just >>> a matter of implementing it. We haven't written >>> up the required peering changes, but they did not seem too difficult to >>> implement. >>> >>> Some notes on the approach are here - just think of 'replicating omap' >>> as a partial write to m+1 shards: >>> >>> http://pad.ceph.com/p/ec-partial-writes >> >> Yeah. To expand a bit on why this only works for Reed-Solomon, >> consider the minimum and appropriate number of copies — and the actual >> shard placement — for local recovery codes. :/ We were unable to >> generalize for that (or indeed for SHEC, IIRC) when whiteboarding. >> >> I'm also still nervous that this might do weird things to our recovery >> and availability patterns in more complex failure cases, but I don't >> have any concrete issues. > > It seems like the minimum-viable variation of this is that we don't change > any of the peering or logging behavior at all, but just send the omap > writes to all shards (like any other write), but only the annointed shards > persist. > > That leaves lots of room for improvement, but it makes the feature work > without many changes, and means we can drop the specialness around rbd > images in EC pools. Won't that still require recovery and read path changes? > Then we can make CephFS and RGW issue warnings (or even refuse) to use EC > pools for their metadata or index pools since it's strictly less efficient > than replicated to avoid user mistakes. If this is only for rbd, we might as well store k+m copies since there's so little omap data. I agree cephfs and rgw continue to refuse to use EC for metadata, since their omap use gets far to large and is in the data path. Josh ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Why does Erasure-pool not support omap? 2017-10-26 16:21 ` Josh Durgin @ 2017-10-26 17:32 ` Matt Benjamin 2017-10-30 22:48 ` Gregory Farnum 1 sibling, 0 replies; 15+ messages in thread From: Matt Benjamin @ 2017-10-26 17:32 UTC (permalink / raw) To: Josh Durgin; +Cc: Sage Weil, Gregory Farnum, Xie Xingguo, ceph-devel On Thu, Oct 26, 2017 at 12:21 PM, Josh Durgin <jdurgin@redhat.com> wrote: > On 10/26/2017 07:26 AM, Sage Weil wrote: >> > > If this is only for rbd, we might as well store k+m copies since there's > so little omap data. > > I agree cephfs and rgw continue to refuse to use EC for metadata, since > their omap use gets far to large and is in the data path. This highlights, but isn't an answer to the problem. What RGW actually has is a split data path, when data is on EC. Matt -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-821-5101 fax. 734-769-8938 cel. 734-216-5309 ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Why does Erasure-pool not support omap? 2017-10-26 16:21 ` Josh Durgin 2017-10-26 17:32 ` Matt Benjamin @ 2017-10-30 22:48 ` Gregory Farnum 2017-10-31 2:26 ` Sage Weil 1 sibling, 1 reply; 15+ messages in thread From: Gregory Farnum @ 2017-10-30 22:48 UTC (permalink / raw) To: Sage Weil; +Cc: Xie Xingguo, ceph-devel, Josh Durgin On Thu, Oct 26, 2017 at 9:21 AM Josh Durgin <jdurgin@redhat.com> wrote: > > On 10/26/2017 07:26 AM, Sage Weil wrote: > > On Thu, 26 Oct 2017, Gregory Farnum wrote: > >> On Wed, Oct 25, 2017 at 8:57 PM, Josh Durgin <jdurgin@redhat.com> wrote: > >>> On 10/25/2017 05:16 AM, Sage Weil wrote: > >>>> > >>>> Hi Xingguo, > >>>> > >>>> On Wed, 25 Oct 2017, xie.xingguo@zte.com.cn wrote: > >>>>> > >>>>> I wonder why erasure-pools can not support omap currently. > >>>>> > >>>>> The simplest way for erasure-pools to support omap I can figure > >>>>> out would be duplicating omap on every shard. > >>>>> > >>>>> It is because it consumes too much space when k + m gets bigger? > >>>> > >>>> > >>>> Right. There isn't a nontrivial way to actually erasure code it, and > >>>> duplicating on every shard is inefficient. > >>>> > >>>> One reasonableish approach would be to replicate the omap data on m+1 > >>>> shards. But it's a bit of work to implement and nobody has done it. > >>>> > >>>> I can't remember if there were concerns with this approach or it was just > >>>> a matter of time/resources... Josh? Greg? > >>> > >>> > >>> It restricts us to erasure codes like reed-solomon where a subset of shards > >>> are always updated. I think this is a reasonable trade-off though, it's just > >>> a matter of implementing it. We haven't written > >>> up the required peering changes, but they did not seem too difficult to > >>> implement. > >>> > >>> Some notes on the approach are here - just think of 'replicating omap' > >>> as a partial write to m+1 shards: > >>> > >>> http://pad.ceph.com/p/ec-partial-writes > >> > >> Yeah. To expand a bit on why this only works for Reed-Solomon, > >> consider the minimum and appropriate number of copies — and the actual > >> shard placement — for local recovery codes. :/ We were unable to > >> generalize for that (or indeed for SHEC, IIRC) when whiteboarding. > >> > >> I'm also still nervous that this might do weird things to our recovery > >> and availability patterns in more complex failure cases, but I don't > >> have any concrete issues. > > > > It seems like the minimum-viable variation of this is that we don't change > > any of the peering or logging behavior at all, but just send the omap > > writes to all shards (like any other write), but only the annointed shards > > persist. > > > > That leaves lots of room for improvement, but it makes the feature work > > without many changes, and means we can drop the specialness around rbd > > images in EC pools. > > Won't that still require recovery and read path changes? I also don't understand at all how this would work. Can you expand, Sage? -Greg ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Why does Erasure-pool not support omap? 2017-10-30 22:48 ` Gregory Farnum @ 2017-10-31 2:26 ` Sage Weil 2017-11-01 20:27 ` Gregory Farnum 0 siblings, 1 reply; 15+ messages in thread From: Sage Weil @ 2017-10-31 2:26 UTC (permalink / raw) To: Gregory Farnum; +Cc: Xie Xingguo, ceph-devel, Josh Durgin [-- Attachment #1: Type: TEXT/PLAIN, Size: 3655 bytes --] On Mon, 30 Oct 2017, Gregory Farnum wrote: > On Thu, Oct 26, 2017 at 9:21 AM Josh Durgin <jdurgin@redhat.com> wrote: > > > > On 10/26/2017 07:26 AM, Sage Weil wrote: > > > On Thu, 26 Oct 2017, Gregory Farnum wrote: > > >> On Wed, Oct 25, 2017 at 8:57 PM, Josh Durgin <jdurgin@redhat.com> wrote: > > >>> On 10/25/2017 05:16 AM, Sage Weil wrote: > > >>>> > > >>>> Hi Xingguo, > > >>>> > > >>>> On Wed, 25 Oct 2017, xie.xingguo@zte.com.cn wrote: > > >>>>> > > >>>>> I wonder why erasure-pools can not support omap currently. > > >>>>> > > >>>>> The simplest way for erasure-pools to support omap I can figure > > >>>>> out would be duplicating omap on every shard. > > >>>>> > > >>>>> It is because it consumes too much space when k + m gets bigger? > > >>>> > > >>>> > > >>>> Right. There isn't a nontrivial way to actually erasure code it, and > > >>>> duplicating on every shard is inefficient. > > >>>> > > >>>> One reasonableish approach would be to replicate the omap data on m+1 > > >>>> shards. But it's a bit of work to implement and nobody has done it. > > >>>> > > >>>> I can't remember if there were concerns with this approach or it was just > > >>>> a matter of time/resources... Josh? Greg? > > >>> > > >>> > > >>> It restricts us to erasure codes like reed-solomon where a subset of shards > > >>> are always updated. I think this is a reasonable trade-off though, it's just > > >>> a matter of implementing it. We haven't written > > >>> up the required peering changes, but they did not seem too difficult to > > >>> implement. > > >>> > > >>> Some notes on the approach are here - just think of 'replicating omap' > > >>> as a partial write to m+1 shards: > > >>> > > >>> http://pad.ceph.com/p/ec-partial-writes > > >> > > >> Yeah. To expand a bit on why this only works for Reed-Solomon, > > >> consider the minimum and appropriate number of copies — and the actual > > >> shard placement — for local recovery codes. :/ We were unable to > > >> generalize for that (or indeed for SHEC, IIRC) when whiteboarding. > > >> > > >> I'm also still nervous that this might do weird things to our recovery > > >> and availability patterns in more complex failure cases, but I don't > > >> have any concrete issues. > > > > > > It seems like the minimum-viable variation of this is that we don't change > > > any of the peering or logging behavior at all, but just send the omap > > > writes to all shards (like any other write), but only the annointed shards > > > persist. > > > > > > That leaves lots of room for improvement, but it makes the feature work > > > without many changes, and means we can drop the specialness around rbd > > > images in EC pools. > > > > Won't that still require recovery and read path changes? > > > I also don't understand at all how this would work. Can you expand, Sage? On write, the ECTransaction collects the omap operation. We either send it to all shards or slide just the omap key/value data for shard_id > k. For shard_id <= k, we write the omap data to the local object. We still send the write op to all shards with attrs and pg log entries. We take care to always select the first acting shard as the primary, which will ensure a shard_id <= k if we go active, such that cls operations and omap reads can be handled locally. Hmm, I think the problem is with rollback, though. IIRC the code is structured around rollback and not rollforward, and omap writes are blind. So, not trivial, but it doesn't require any of the stuff we were talking about before where we'd only send writes to a subset of shards and have incomplete pg logs on each shard. sage ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Why does Erasure-pool not support omap? 2017-10-31 2:26 ` Sage Weil @ 2017-11-01 20:27 ` Gregory Farnum 0 siblings, 0 replies; 15+ messages in thread From: Gregory Farnum @ 2017-11-01 20:27 UTC (permalink / raw) To: Sage Weil; +Cc: Xie Xingguo, ceph-devel, Josh Durgin On Mon, Oct 30, 2017 at 7:26 PM Sage Weil <sweil@redhat.com> wrote: > > On Mon, 30 Oct 2017, Gregory Farnum wrote: > > On Thu, Oct 26, 2017 at 9:21 AM Josh Durgin <jdurgin@redhat.com> wrote: > > > > > > On 10/26/2017 07:26 AM, Sage Weil wrote: > > > > On Thu, 26 Oct 2017, Gregory Farnum wrote: > > > >> On Wed, Oct 25, 2017 at 8:57 PM, Josh Durgin <jdurgin@redhat.com> wrote: > > > >>> On 10/25/2017 05:16 AM, Sage Weil wrote: > > > >>>> > > > >>>> Hi Xingguo, > > > >>>> > > > >>>> On Wed, 25 Oct 2017, xie.xingguo@zte.com.cn wrote: > > > >>>>> > > > >>>>> I wonder why erasure-pools can not support omap currently. > > > >>>>> > > > >>>>> The simplest way for erasure-pools to support omap I can figure > > > >>>>> out would be duplicating omap on every shard. > > > >>>>> > > > >>>>> It is because it consumes too much space when k + m gets bigger? > > > >>>> > > > >>>> > > > >>>> Right. There isn't a nontrivial way to actually erasure code it, and > > > >>>> duplicating on every shard is inefficient. > > > >>>> > > > >>>> One reasonableish approach would be to replicate the omap data on m+1 > > > >>>> shards. But it's a bit of work to implement and nobody has done it. > > > >>>> > > > >>>> I can't remember if there were concerns with this approach or it was just > > > >>>> a matter of time/resources... Josh? Greg? > > > >>> > > > >>> > > > >>> It restricts us to erasure codes like reed-solomon where a subset of shards > > > >>> are always updated. I think this is a reasonable trade-off though, it's just > > > >>> a matter of implementing it. We haven't written > > > >>> up the required peering changes, but they did not seem too difficult to > > > >>> implement. > > > >>> > > > >>> Some notes on the approach are here - just think of 'replicating omap' > > > >>> as a partial write to m+1 shards: > > > >>> > > > >>> http://pad.ceph.com/p/ec-partial-writes > > > >> > > > >> Yeah. To expand a bit on why this only works for Reed-Solomon, > > > >> consider the minimum and appropriate number of copies — and the actual > > > >> shard placement — for local recovery codes. :/ We were unable to > > > >> generalize for that (or indeed for SHEC, IIRC) when whiteboarding. > > > >> > > > >> I'm also still nervous that this might do weird things to our recovery > > > >> and availability patterns in more complex failure cases, but I don't > > > >> have any concrete issues. > > > > > > > > It seems like the minimum-viable variation of this is that we don't change > > > > any of the peering or logging behavior at all, but just send the omap > > > > writes to all shards (like any other write), but only the annointed shards > > > > persist. > > > > > > > > That leaves lots of room for improvement, but it makes the feature work > > > > without many changes, and means we can drop the specialness around rbd > > > > images in EC pools. > > > > > > Won't that still require recovery and read path changes? > > > > > > I also don't understand at all how this would work. Can you expand, Sage? > > On write, the ECTransaction collects the omap operation. We either send > it to all shards or slide just the omap key/value data for shard_id > k. > For shard_id <= k, we write the omap data to the local object. We still > send the write op to all shards with attrs and pg log entries. > > We take care to always select the first acting shard as the primary, which > will ensure a shard_id <= k if we go active, such that cls operations and > omap reads can be handled locally. > > Hmm, I think the problem is with rollback, though. IIRC the code is > structured around rollback and not rollforward, and omap writes are blind. > > So, not trivial, but it doesn't require any of the stuff we were talking > about before where we'd only send writes to a subset of shards and have > incomplete pg logs on each shard. Okay, so by "don't change any of the peering or logging behavior at all", you meant we didn't have to do any of the stuff that starts accounting for differing pg versions on each shard's object. But of course we still need to make a number of changes to the peering and recovery code so we select the right shards. We *could* do that, but it seems like another of the "sorta there" features we end up regretting. I guess the apparent hurry to get it in before other EC pool enhancements are ready is to avoid the rbd header pool? How much effort does that actually save? (Even the minimal peering+recovery changes here will take a fair bit of doing and a lot of QA qualification.) -Greg ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2017-11-01 20:27 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <201710251652060421729@zte.com.cn> 2017-10-25 12:16 ` Why does Erasure-pool not support omap? Sage Weil 2017-10-25 18:57 ` Josh Durgin 2017-10-26 14:20 ` Gregory Farnum 2017-10-26 14:26 ` Sage Weil 2017-10-26 15:07 ` Matt Benjamin 2017-10-26 15:08 ` Jason Dillaman 2017-10-26 15:35 ` Matt Benjamin 2017-10-26 15:49 ` Jason Dillaman 2017-10-26 15:50 ` Matt Benjamin 2017-10-26 16:10 ` Sage Weil 2017-10-26 16:21 ` Josh Durgin 2017-10-26 17:32 ` Matt Benjamin 2017-10-30 22:48 ` Gregory Farnum 2017-10-31 2:26 ` Sage Weil 2017-11-01 20:27 ` Gregory Farnum
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.