From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sage Weil Subject: Re: bluestore blobs REVISITED Date: Wed, 24 Aug 2016 22:29:45 +0000 (UTC) Message-ID: References: <2F51EC8C-D280-4DFF-8FF6-4AC97071A7D5@sandisk.com> <35D6E177-5070-4853-8CF4-FBDA14B08CAD@sandisk.com>, <67930A6B-070C-4C81-92D8-351AE28D5890@sandisk.com> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Return-path: Received: from mx1.redhat.com ([209.132.183.28]:56982 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756534AbcHXWii (ORCPT ); Wed, 24 Aug 2016 18:38:38 -0400 In-Reply-To: <67930A6B-070C-4C81-92D8-351AE28D5890@sandisk.com> Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Allen Samuels Cc: "jdurgin@redhat.com" , "dillaman@redhat.com" , "ceph-devel@vger.kernel.org" On Wed, 24 Aug 2016, Allen Samuels wrote: > Yikes. You mean that blob ids are escaping the environment of the > lextent table. That's scary. What is the key for this cache? We probably > need to invalidate it or something. I mean that there will no longer be blob ids (except within the encoding of a particular extent map shard). Which means that when you write to A, clone A->B, and then read B, B's blob will no longer be the same as A's blob (as it is now in the bnode, or would have been with the -blobwise branch) and the cache won't be preserved. Which I *think* is okay...? sage > > Sent from my iPhone. Please excuse all typos and autocorrects. > > > On Aug 24, 2016, at 5:18 PM, Sage Weil wrote: > > > > On Wed, 24 Aug 2016, Allen Samuels wrote: > >>> In that case, we should focus instead on sharing the ref_map *only* and > >>> always inline the forward pointers for the blob. This is closer to what > >>> we were originally doing with the enode. In fact, we could go back to the > >>> enode approach were it's just a big extent_ref_map and only used to defer > >>> deallocations until all refs are retired. The blob is then more ephemeral > >>> (local to the onode, immutable copy if cloned), and we can more easily > >>> rejigger how we store it. > >>> > >>> We'd still have a "ref map" type structure for the blob, but it would only > >>> be used for counting the lextents that reference it, and we can > >>> dynamically build it when we load the extent map. If we impose the > >>> restriction that whatever the map sharding approach we take we never share > >>> a blob across a shard, we the blobs are always local and "ephemeral" > >>> in the sense we've been talking about. The only downside here, I think, > >>> is that the write path needs to be smart enough to not create any new blob > >>> that spans whatever the current map sharding is (or, alternatively, > >>> trigger a resharding if it does so). > >> > >> Not just a resharding but also a possible decompress recompress cycle. > > > > Yeah. > > > > Oh, the other consequence of this is that we lose the unified blob-wise > > cache behavior we added a while back. That means that if you write a > > bunch of data to a rbd data object, then clone it, then read of the clone, > > it'll re-read the data from disk. Because it'll be a different blob in > > memory (since we'll be making a copy of the metadata etc). > > > > Josh, Jason, do you have a sense of whether that really matters? The > > common case is probably someone who creates a snapshot and then backs it > > up, but it's going to be reading gobs of cold data off disk anyway so I'm > > guessing it doesn't matter that a bit of warm data that just preceded the > > snapshot gets re-read. > > > > sage > > > >