All of lore.kernel.org
 help / color / mirror / Atom feed
From: Matt Benjamin <mbenjami@redhat.com>
To: Sage Weil <sweil@redhat.com>
Cc: Gregory Farnum <gfarnum@redhat.com>,
	Josh Durgin <jdurgin@redhat.com>,
	Xie Xingguo <xie.xingguo@zte.com.cn>,
	ceph-devel <ceph-devel@vger.kernel.org>
Subject: Re: Why does Erasure-pool not support omap?
Date: Thu, 26 Oct 2017 11:07:22 -0400	[thread overview]
Message-ID: <CAKOnarm2DSyKvTLKLsqnUZOOGsXz3rvpyZnjwEWR6LQDmsHs+A@mail.gmail.com> (raw)
In-Reply-To: <alpine.DEB.2.11.1710261423200.22592@piezo.us.to>

That sounds like a promising way forward, to me.

Matt

On Thu, Oct 26, 2017 at 10:26 AM, Sage Weil <sweil@redhat.com> wrote:
> On Thu, 26 Oct 2017, Gregory Farnum wrote:
>> On Wed, Oct 25, 2017 at 8:57 PM, Josh Durgin <jdurgin@redhat.com> wrote:
>> > On 10/25/2017 05:16 AM, Sage Weil wrote:
>> >>
>> >> Hi Xingguo,
>> >>
>> >> On Wed, 25 Oct 2017, xie.xingguo@zte.com.cn wrote:
>> >>>
>> >>>        I wonder why erasure-pools can not support omap currently.
>> >>>
>> >>>        The simplest way for erasure-pools to support omap I can figure
>> >>> out would be duplicating omap on every shard.
>> >>>
>> >>>        It is because it consumes too much space when k + m gets bigger?
>> >>
>> >>
>> >> Right.  There isn't a nontrivial way to actually erasure code it, and
>> >> duplicating on every shard is inefficient.
>> >>
>> >> One reasonableish approach would be to replicate the omap data on m+1
>> >> shards.  But it's a bit of work to implement and nobody has done it.
>> >>
>> >> I can't remember if there were concerns with this approach or it was just
>> >> a matter of time/resources... Josh? Greg?
>> >
>> >
>> > It restricts us to erasure codes like reed-solomon where a subset of shards
>> > are always updated. I think this is a reasonable trade-off though, it's just
>> > a matter of implementing it. We haven't written
>> > up the required peering changes, but they did not seem too difficult to
>> > implement.
>> >
>> > Some notes on the approach are here - just think of 'replicating omap'
>> > as a partial write to m+1 shards:
>> >
>> > http://pad.ceph.com/p/ec-partial-writes
>>
>> Yeah. To expand a bit on why this only works for Reed-Solomon,
>> consider the minimum and appropriate number of copies — and the actual
>> shard placement — for local recovery codes. :/ We were unable to
>> generalize for that (or indeed for SHEC, IIRC) when whiteboarding.
>>
>> I'm also still nervous that this might do weird things to our recovery
>> and availability patterns in more complex failure cases, but I don't
>> have any concrete issues.
>
> It seems like the minimum-viable variation of this is that we don't change
> any of the peering or logging behavior at all, but just send the omap
> writes to all shards (like any other write), but only the annointed shards
> persist.
>
> That leaves lots of room for improvement, but it makes the feature work
> without many changes, and means we can drop the specialness around rbd
> images in EC pools.
>
> Then we can make CephFS and RGW issue warnings (or even refuse) to use EC
> pools for their metadata or index pools since it's strictly less efficient
> than replicated to avoid user mistakes.
>
> ?
>
> sage



-- 

Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309

  reply	other threads:[~2017-10-26 15:07 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <201710251652060421729@zte.com.cn>
2017-10-25 12:16 ` Why does Erasure-pool not support omap? Sage Weil
2017-10-25 18:57   ` Josh Durgin
2017-10-26 14:20     ` Gregory Farnum
2017-10-26 14:26       ` Sage Weil
2017-10-26 15:07         ` Matt Benjamin [this message]
2017-10-26 15:08         ` Jason Dillaman
2017-10-26 15:35           ` Matt Benjamin
2017-10-26 15:49             ` Jason Dillaman
2017-10-26 15:50               ` Matt Benjamin
2017-10-26 16:10           ` Sage Weil
2017-10-26 16:21         ` Josh Durgin
2017-10-26 17:32           ` Matt Benjamin
2017-10-30 22:48           ` Gregory Farnum
2017-10-31  2:26             ` Sage Weil
2017-11-01 20:27               ` Gregory Farnum

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAKOnarm2DSyKvTLKLsqnUZOOGsXz3rvpyZnjwEWR6LQDmsHs+A@mail.gmail.com \
    --to=mbenjami@redhat.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=gfarnum@redhat.com \
    --cc=jdurgin@redhat.com \
    --cc=sweil@redhat.com \
    --cc=xie.xingguo@zte.com.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.