From mboxrd@z Thu Jan  1 00:00:00 1970
From: Sage Weil <sweil@redhat.com>
Subject: Re: Why does Erasure-pool not support omap?
Date: Thu, 26 Oct 2017 16:10:38 +0000 (UTC)
Message-ID: <alpine.DEB.2.11.1710261607160.22592@piezo.us.to>
References: <201710251652060421729@zte.com.cn> <alpine.DEB.2.11.1710251213140.22592@piezo.us.to> <37aa304a-f13b-2cf8-c5c5-5d03d37feffa@redhat.com> <CAJ4mKGa+Xpt9a2P0y_cz9bohj7R_hP_Ysiar4oqRgW-yWJY+ug@mail.gmail.com> <alpine.DEB.2.11.1710261423200.22592@piezo.us.to>
 <CA+aFP1CtWxqXR0paxrkYku7+NYK23ncT=ffa-6j4Ft9cfW69Hg@mail.gmail.com>
Mime-Version: 1.0
Content-Type: MULTIPART/MIXED; BOUNDARY="8323329-379065954-1509034242=:22592"
Return-path: <ceph-devel-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:22240 "EHLO mx1.redhat.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S932364AbdJZQKn (ORCPT <rfc822;ceph-devel@vger.kernel.org>);
        Thu, 26 Oct 2017 12:10:43 -0400
In-Reply-To: <CA+aFP1CtWxqXR0paxrkYku7+NYK23ncT=ffa-6j4Ft9cfW69Hg@mail.gmail.com>
Sender: ceph-devel-owner@vger.kernel.org
List-ID: <ceph-devel.vger.kernel.org>
To: dillaman@redhat.com
Cc: Gregory Farnum <gfarnum@redhat.com>, Josh Durgin <jdurgin@redhat.com>, Xie Xingguo <xie.xingguo@zte.com.cn>, ceph-devel <ceph-devel@vger.kernel.org>

  This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.

--8323329-379065954-1509034242=:22592
Content-Type: TEXT/PLAIN; charset=UTF-8
Content-Transfer-Encoding: 8BIT

On Thu, 26 Oct 2017, Jason Dillaman wrote:
> On Thu, Oct 26, 2017 at 10:26 AM, Sage Weil <sweil@redhat.com> wrote:
> > On Thu, 26 Oct 2017, Gregory Farnum wrote:
> >> On Wed, Oct 25, 2017 at 8:57 PM, Josh Durgin <jdurgin@redhat.com> wrote:
> >> > On 10/25/2017 05:16 AM, Sage Weil wrote:
> >> >>
> >> >> Hi Xingguo,
> >> >>
> >> >> On Wed, 25 Oct 2017, xie.xingguo@zte.com.cn wrote:
> >> >>>
> >> >>>        I wonder why erasure-pools can not support omap currently.
> >> >>>
> >> >>>        The simplest way for erasure-pools to support omap I can figure
> >> >>> out would be duplicating omap on every shard.
> >> >>>
> >> >>>        It is because it consumes too much space when k + m gets bigger?
> >> >>
> >> >>
> >> >> Right.  There isn't a nontrivial way to actually erasure code it, and
> >> >> duplicating on every shard is inefficient.
> >> >>
> >> >> One reasonableish approach would be to replicate the omap data on m+1
> >> >> shards.  But it's a bit of work to implement and nobody has done it.
> >> >>
> >> >> I can't remember if there were concerns with this approach or it was just
> >> >> a matter of time/resources... Josh? Greg?
> >> >
> >> >
> >> > It restricts us to erasure codes like reed-solomon where a subset of shards
> >> > are always updated. I think this is a reasonable trade-off though, it's just
> >> > a matter of implementing it. We haven't written
> >> > up the required peering changes, but they did not seem too difficult to
> >> > implement.
> >> >
> >> > Some notes on the approach are here - just think of 'replicating omap'
> >> > as a partial write to m+1 shards:
> >> >
> >> > http://pad.ceph.com/p/ec-partial-writes
> >>
> >> Yeah. To expand a bit on why this only works for Reed-Solomon,
> >> consider the minimum and appropriate number of copies — and the actual
> >> shard placement — for local recovery codes. :/ We were unable to
> >> generalize for that (or indeed for SHEC, IIRC) when whiteboarding.
> >>
> >> I'm also still nervous that this might do weird things to our recovery
> >> and availability patterns in more complex failure cases, but I don't
> >> have any concrete issues.
> >
> > It seems like the minimum-viable variation of this is that we don't change
> > any of the peering or logging behavior at all, but just send the omap
> > writes to all shards (like any other write), but only the annointed shards
> > persist.
> >
> > That leaves lots of room for improvement, but it makes the feature work
> > without many changes, and means we can drop the specialness around rbd
> > images in EC pools.
> 
> Potentially negative since RBD relies heavily on class methods.
> Assuming the cls_cxx_map_XYZ operations will never require async work,
> there is still the issue with methods that perform straight read/write
> calls.

Ooooh right, I remember now.  This can be avoided most of the time by 
making the +1 of k+1 be the (normal) primary, but in a degraded situation 
the primary might be a shard that doesn't have a copy of the omap at all, 
in which case a simple omap read would need to be async.

...and perhaps this too can be avoided by making the primary role always 
be one of the k+1 shards that has a copy of the omap data.

I think cls_rgw is in the same boat as cls_rbd, in that it is attr and 
omap only and doesn't use the object data payload of index objects?

sage
--8323329-379065954-1509034242=:22592--