From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sage Weil Subject: RE: Bluestore different allocator performance Vs FileStore Date: Thu, 11 Aug 2016 21:19:49 +0000 (UTC) Message-ID: References: <1431B127-59B3-4DCA-B3F2-FBB209ED2059@sandisk.com> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Return-path: Received: from cobra.newdream.net ([66.33.216.30]:54082 "EHLO cobra.newdream.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752026AbcHKVTw (ORCPT ); Thu, 11 Aug 2016 17:19:52 -0400 In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Allen Samuels Cc: Ramesh Chander , Somnath Roy , ceph-devel On Thu, 11 Aug 2016, Allen Samuels wrote: > > -----Original Message----- > > From: Sage Weil [mailto:sage@newdream.net] > > Sent: Thursday, August 11, 2016 1:24 PM > > To: Allen Samuels > > Cc: Ramesh Chander ; Somnath Roy > > ; ceph-devel > > Subject: RE: Bluestore different allocator performance Vs FileStore > > > > On Thu, 11 Aug 2016, Allen Samuels wrote: > > > Perhaps my understanding of the blueFS is incorrect -- so please > > > clarify as needed. > > > > > > I thought that the authoritative indication of space used by BlueFS > > > was contained in the snapshot/journal of BlueFS itself, NOT in the KV > > > store itself. This requires that upon startup, we replay the BlueFS > > > snapshot/journal into the FreeListManager so that it properly records > > > the consumption of BlueFS space (since that allocation MAY NOT be > > > accurate within the FreeListmanager itself). But that this playback > > > need not generate an KVStore operations (since those are duplicates of > > > the BlueFS). > > > > > > So in the code you cite: > > > > > > fm->allocate(0, reserved, t); > > > > > > There's no need to commit 't', and in fact, in the general case, you > > > don't want to commit 't'. > > > > > > That suggests to me that a version of allocate that doesn't have a > > > transaction could be easily created would have the speed we're looking > > > for (and independence from the BitMapAllocator to KVStore chunking). > > > > Oh, I see. Yeah, you're right--this step isn't really necessary, as long as we > > ensure that the auxilliary representation of what bluefs owns > > (bluefs_extents in the superblock) is still passed into the Allocator during > > initialization. Having the freelist reflect the allocator that this space was "in > > use" (by bluefs) and thus off limits to bluestore is simple but not strictly > > necessary. > > > > I'll work on a PR that avoids this... https://github.com/ceph/ceph/pull/10698 Ramesh, can you give it a try? > > > I suspect that we also have long startup times because we're doing the > > > same underlying bitmap operations except they come from the BlueFS > > > replay code instead of the BlueFS initialization code, but same > > > problem with likely the same fix. > > > > BlueFS doesn't touch the FreelistManager (or explicitly persist the freelist at > > all)... we initialize the in-memory Allocator state from the metadata in the > > bluefs log. I think we should be fine on this end. > > Likely that code suffers from the same problem -- a false need to update > the KV Store (During the playback, BlueFS extents are converted to > bitmap runs, it's essentially the same lower level code as the case > we're seeing now, but it instead of being driven by an artificial "big > run", it'sll be driven from the BlueFS Journal replay code). But that's > just a guess, I don't have time to track down the actual code right now. BlueFS can't touch the freelist (or kv store, ever) since it ultimately backs the kv store and that would be problematic. We do initialize the bluefs Allocator's in-memory state, but that's it. The PR above changes the BlueStore::_init_alloc() so that BlueStore's Allocator state is initialize with both the freelist state (from kv store) *and* the bluefs_extents list (from the bluestore superblock). (From this Allocator's perspective, all of bluefs's space is allocated and can't be used. BlueFS has it's own separate instance to do it's internal allocations.) sage