From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sage Weil Subject: Re: Why does Erasure-pool not support omap? Date: Thu, 26 Oct 2017 16:10:38 +0000 (UTC) Message-ID: References: <201710251652060421729@zte.com.cn> <37aa304a-f13b-2cf8-c5c5-5d03d37feffa@redhat.com> Mime-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="8323329-379065954-1509034242=:22592" Return-path: Received: from mx1.redhat.com ([209.132.183.28]:22240 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932364AbdJZQKn (ORCPT ); Thu, 26 Oct 2017 12:10:43 -0400 In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: dillaman@redhat.com Cc: Gregory Farnum , Josh Durgin , Xie Xingguo , ceph-devel This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --8323329-379065954-1509034242=:22592 Content-Type: TEXT/PLAIN; charset=UTF-8 Content-Transfer-Encoding: 8BIT On Thu, 26 Oct 2017, Jason Dillaman wrote: > On Thu, Oct 26, 2017 at 10:26 AM, Sage Weil wrote: > > On Thu, 26 Oct 2017, Gregory Farnum wrote: > >> On Wed, Oct 25, 2017 at 8:57 PM, Josh Durgin wrote: > >> > On 10/25/2017 05:16 AM, Sage Weil wrote: > >> >> > >> >> Hi Xingguo, > >> >> > >> >> On Wed, 25 Oct 2017, xie.xingguo@zte.com.cn wrote: > >> >>> > >> >>> I wonder why erasure-pools can not support omap currently. > >> >>> > >> >>> The simplest way for erasure-pools to support omap I can figure > >> >>> out would be duplicating omap on every shard. > >> >>> > >> >>> It is because it consumes too much space when k + m gets bigger? > >> >> > >> >> > >> >> Right. There isn't a nontrivial way to actually erasure code it, and > >> >> duplicating on every shard is inefficient. > >> >> > >> >> One reasonableish approach would be to replicate the omap data on m+1 > >> >> shards. But it's a bit of work to implement and nobody has done it. > >> >> > >> >> I can't remember if there were concerns with this approach or it was just > >> >> a matter of time/resources... Josh? Greg? > >> > > >> > > >> > It restricts us to erasure codes like reed-solomon where a subset of shards > >> > are always updated. I think this is a reasonable trade-off though, it's just > >> > a matter of implementing it. We haven't written > >> > up the required peering changes, but they did not seem too difficult to > >> > implement. > >> > > >> > Some notes on the approach are here - just think of 'replicating omap' > >> > as a partial write to m+1 shards: > >> > > >> > http://pad.ceph.com/p/ec-partial-writes > >> > >> Yeah. To expand a bit on why this only works for Reed-Solomon, > >> consider the minimum and appropriate number of copies — and the actual > >> shard placement — for local recovery codes. :/ We were unable to > >> generalize for that (or indeed for SHEC, IIRC) when whiteboarding. > >> > >> I'm also still nervous that this might do weird things to our recovery > >> and availability patterns in more complex failure cases, but I don't > >> have any concrete issues. > > > > It seems like the minimum-viable variation of this is that we don't change > > any of the peering or logging behavior at all, but just send the omap > > writes to all shards (like any other write), but only the annointed shards > > persist. > > > > That leaves lots of room for improvement, but it makes the feature work > > without many changes, and means we can drop the specialness around rbd > > images in EC pools. > > Potentially negative since RBD relies heavily on class methods. > Assuming the cls_cxx_map_XYZ operations will never require async work, > there is still the issue with methods that perform straight read/write > calls. Ooooh right, I remember now. This can be avoided most of the time by making the +1 of k+1 be the (normal) primary, but in a degraded situation the primary might be a shard that doesn't have a copy of the omap at all, in which case a simple omap read would need to be async. ...and perhaps this too can be avoided by making the primary role always be one of the k+1 shards that has a copy of the omap data. I think cls_rgw is in the same boat as cls_rbd, in that it is attr and omap only and doesn't use the object data payload of index objects? sage --8323329-379065954-1509034242=:22592--